Learning Data Science: Data Wrangling, Exploration, Visualization, and Modeling with Python

Name: Learning Data Science: Data Wrangling, Exploration, Visualization, and Modeling with Python
Author: Gonzalez Joseph, Lau Sam, Nolan Deborah (Deb)

Автор: Gonzalez Joseph, Lau Sam, Nolan Deborah (Deb)

Дата выхода: 2023

Издательство: O’Reilly Media, Inc.

Количество страниц: 597

Размер файла: 5,6 МБ

Тип файла: PDF

Добавил: codelibs

Проверить на вирусы

Cover

Table of Contents

Preface

Expected Background Knowledge

Organization of the Book

Conventions Used in This Book

Using Code Examples

O’Reilly Online Learning

How to Contact Us

Acknowledgments

Part I. The Data Science Lifecycle

Chapter 1. The Data Science Lifecycle

The Stages of the Lifecycle

Examples of the Lifecycle

Summary

Chapter 2. Questions and Data Scope

Big Data and New Opportunities

Example: Google Flu Trends

Target Population, Access Frame, and Sample

Example: What Makes Members of an Online Community Active?

Example: Who Will Win the Election?

Example: How Do Environmental Hazards Relate to an Individual’s Health?

Instruments and Protocols

Measuring Natural Phenomena

Example: What Is the Level of CO2 in the Air?

Accuracy

Types of Bias

Types of Variation

Summary

Chapter 3. Simulation and Data Design

The Urn Model

Sampling Designs

Sampling Distribution of a Statistic

Simulating the Sampling Distribution

Simulation with the Hypergeometric Distribution

Example: Simulating Election Poll Bias and Variance

The Pennsylvania Urn Model

An Urn Model with Bias

Conducting Larger Polls

Example: Simulating a Randomized Trial for a Vaccine

Scope

The Urn Model for Random Assignment

Example: Measuring Air Quality

Summary

Chapter 4. Modeling with Summary Statistics

The Constant Model

Minimizing Loss

Mean Absolute Error

Mean Squared Error

Choosing Loss Functions

Summary

Chapter 5. Case Study: Why Is My Bus Always Late?

Question and Scope

Data Wrangling

Exploring Bus Times

Modeling Wait Times

Summary

Part II. Rectangular Data

Chapter 6. Working with Dataframes Using pandas

Subsetting

Data Scope and Question

Dataframes and Indices

Slicing

Filtering Rows

Example: How Recently Has Luna Become a Popular Name?

Aggregating

Basic Group-Aggregate

Grouping on Multiple Columns

Custom Aggregation Functions

Pivoting

Joining

Inner Joins

Left, Right, and Outer Joins

Example: Popularity of NYT Name Categories

Transforming

Apply

Example: Popularity of “L” Names

The Price of Apply

How Are Dataframes Different from Other Data Representations?

Dataframes and Spreadsheets

Dataframes and Matrices

Dataframes and Relations

Summary

Chapter 7. Working with Relations Using SQL

Subsetting

SQL Basics: SELECT and FROM

What’s a Relation?

Slicing

Filtering Rows

Example: How Recently Has Luna Become a Popular Name?

Aggregating

Basic Group-Aggregate Using GROUP BY

Grouping on Multiple Columns

Other Aggregation Functions

Joining

Inner Joins

Left and Right Joins

Example: Popularity of NYT Name Categories

Transforming and Common Table Expressions

SQL Functions

Multistep Queries Using a WITH Clause

Example: Popularity of “L” Names

Summary

Part III. Understanding The Data

Chapter 8. Wrangling Files

Data Source Examples

Drug Abuse Warning Network (DAWN) Survey

San Francisco Restaurant Food Safety

File Formats

Delimited Format

Fixed-Width Format

Hierarchical Formats

Loosely Formatted Text

File Encoding

File Size

The Shell and Command-Line Tools

Table Shape and Granularity

Granularity of Restaurant Inspections and Violations

DAWN Survey Shape and Granularity

Summary

Chapter 9. Wrangling Dataframes

Example: Wrangling CO2 Measurements from the Mauna Loa Observatory

Quality Checks

Addressing Missing Data

Reshaping the Data Table

Quality Checks

Quality Based on Scope

Quality of Measurements and Recorded Values

Quality Across Related Features

Quality for Analysis

Fixing the Data or Not

Missing Values and Records

Transformations and Timestamps

Transforming Timestamps

Piping for Transformations

Modifying Structure

Example: Wrangling Restaurant Safety Violations

Narrowing the Focus

Aggregating Violations

Extracting Information from Violation Descriptions

Summary

Chapter 10. Exploratory Data Analysis

Feature Types

Example: Dog Breeds

Transforming Qualitative Features

The Importance of Feature Types

What to Look For in a Distribution

What to Look For in a Relationship

Two Quantitative Features

One Qualitative and One Quantitative Variable

Two Qualitative Features

Comparisons in Multivariate Settings

Guidelines for Exploration

Example: Sale Prices for Houses

Understanding Price

What Next?

Examining Other Features

Delving Deeper into Relationships

Fixing Location

EDA Discoveries

Summary

Chapter 11. Data Visualization

Choosing Scale to Reveal Structure

Filling the Data Region

Including Zero

Revealing Shape Through Transformations

Banking to Decipher Relationships

Revealing Relationships Through Straightening

Smoothing and Aggregating Data

Smoothing Techniques to Uncover Shape

Smoothing Techniques to Uncover Relationships and Trends

Smoothing Techniques Need Tuning

Reducing Distributions to Quantiles

When Not to Smooth

Facilitating Meaningful Comparisons

Emphasize the Important Difference

Ordering Groups

Avoid Stacking

Selecting a Color Palette

Guidelines for Comparisons in Plots

Incorporating the Data Design

Data Collected Over Time

Observational Studies

Unequal Sampling

Geographic Data

Adding Context

Example: 100m Sprint Times

Creating Plots Using plotly

Figure and Trace Objects

Modifying Layout

Plotting Functions

Annotations

Other Tools for Visualization

matplotlib

Grammar of Graphics

Summary

Chapter 12. Case Study: How Accurate Are Air Quality Measurements?

Question, Design, and Scope

Finding Collocated Sensors

Wrangling the List of AQS Sites

Wrangling the List of PurpleAir Sites

Matching AQS and PurpleAir Sensors

Wrangling and Cleaning AQS Sensor Data

Checking Granularity

Removing Unneeded Columns

Checking the Validity of Dates

Checking the Quality of PM2.5 Measurements

Wrangling PurpleAir Sensor Data

Checking the Granularity

Handling Missing Values

Exploring PurpleAir and AQS Measurements

Creating a Model to Correct PurpleAir Measurements

Summary

Part IV. Other Data Sources

Chapter 13. Working with Text

Examples of Text and Tasks

Convert Text into a Standard Format

Extract a Piece of Text to Create a Feature

Transform Text into Features

Text Analysis

String Manipulation

Converting Text to a Standard Format with Python String Methods

String Methods in pandas

Splitting Strings to Extract Pieces of Text

Regular Expressions

Concatenation of Literals

Quantifiers

Alternation and Grouping to Create Features

Reference Tables

Text Analysis

Summary

Chapter 14. Data Exchange

NetCDF Data

JSON Data

HTTP

REST

XML, HTML, and XPath

Example: Scraping Race Times from Wikipedia

XPath

Example: Accessing Exchange Rates from the ECB

Summary

Part V. Linear Modeling

Chapter 15. Linear Models

Simple Linear Model

Example: A Simple Linear Model for Air Quality

Interpreting Linear Models

Assessing the Fit

Fitting the Simple Linear Model

Multiple Linear Model

Fitting the Multiple Linear Model

Example: Where Is the Land of Opportunity?

Explaining Upward Mobility Using Commute Time

Relating Upward Mobility Using Multiple Variables

Feature Engineering for Numeric Measurements

Feature Engineering for Categorical Measurements

Summary

Chapter 16. Model Selection

Overfitting

Example: Energy Consumption

Train-Test Split

Cross-Validation

Regularization

Model Bias and Variance

Summary

Chapter 17. Theory for Inference and Prediction

Distributions: Population, Empirical, Sampling

Basics of Hypothesis Testing

Example: A Rank Test to Compare Productivity of Wikipedia Contributors

Example: A Test of Proportions for Vaccine Efficacy

Bootstrapping for Inference

Basics of Confidence Intervals

Basics of Prediction Intervals

Example: Predicting Bus Lateness

Example: Predicting Crab Size

Example: Predicting the Incremental Growth of a Crab

Probability for Inference and Prediction

Formalizing the Theory for Average Rank Statistics

General Properties of Random Variables

Probability Behind Testing and Intervals

Probability Behind Model Selection

Summary

Chapter 18. Case Study: How to Weigh a Donkey

Donkey Study Question and Scope

Wrangling and Transforming

Exploring

Modeling a Donkey’s Weight

A Loss Function for Prescribing Anesthetics

Fitting a Simple Linear Model

Fitting a Multiple Linear Model

Bringing Qualitative Features into the Model

Model Assessment

Summary

Part VI. Classification

Chapter 19. Classification

Example: Wind-Damaged Trees

Modeling and Classification

A Constant Model

Examining the Relationship Between Size and Windthrow

Modeling Proportions (and Probabilities)

A Logistic Model

Log Odds

Using a Logistic Curve

A Loss Function for the Logistic Model

From Probabilities to Classification

The Confusion Matrix

Precision Versus Recall

Summary

Chapter 20. Numerical Optimization

Gradient Descent Basics

Minimizing Huber Loss

Convex and Differentiable Loss Functions

Variants of Gradient Descent

Stochastic Gradient Descent

Mini-Batch Gradient Descent

Newton’s Method

Summary

Chapter 21. Case Study: Detecting Fake News

Question and Scope

Obtaining and Wrangling the Data

Exploring the Data

Exploring the Publishers

Exploring Publication Date

Exploring Words in Articles

Modeling

A Single-Word Model

Multiple-Word Model

Predicting with the tf-idf Transform

Summary

Bibliography

Data Sources

Index

About the Authors

Colophon

As an aspiring data scientist, you appreciate why organizations rely on data for important decisions—whether it's for companies designing websites, cities deciding how to improve services, or scientists discovering how to stop the spread of disease. And you want the skills required to distill a messy pile of data into actionable insights. We call this the data science lifecycle: the process of collecting, wrangling, analyzing, and drawing conclusions from data.

Learning Data Science is the first book to cover foundational skills in both programming and statistics that encompass this entire lifecycle. It's aimed at those who wish to become data scientists or who already work with data scientists, and at data analysts who wish to cross the "technical/nontechnical" divide. If you have a basic knowledge of Python programming, you'll learn how to work with data using industry-standard tools like pandas.

Refine a question of interest to one that can be studied with data
Pursue data collection that may involve text processing, web scraping, etc.
Glean valuable insights about data through data cleaning, exploration, and visualization
Learn how to use modeling to describe the data
Generalize findings beyond the data

Если вам понравилась эта страница - поделитесь ею с друзьями, тем самым вы помогаете нам развиваться и добавлять всё больше интересных и нужным вам книг