Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter. 3 Ed

Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter. 3 Ed

Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter. 3 Ed
Автор: McKinney Wes
Дата выхода: 2022
Издательство: O’Reilly Media, Inc.
Количество страниц: 582
Размер файла: 2.6 MB
Тип файла: PDF
Добавил: codelibs
 Проверить на вирусы

Cover....1

Copyright....4

Table of Contents....5

Preface....13

Section 1. Conventions Used in This Book....13

Section 2. Using Code Examples....14

Section 3. O’Reilly Online Learning....15

Section 4. How to Contact Us....15

Section 5. Acknowledgments....16

In Memoriam: John D. Hunter (1968–2012)....16

Acknowledgments for the Third Edition (2022)....16

Acknowledgments for the Second Edition (2017)....17

Acknowledgments for the First Edition (2012)....18

Chapter 1. Preliminaries....19

1.1 What Is This Book About?....19

What Kinds of Data?....19

1.2 Why Python for Data Analysis?....20

Python as Glue....21

Solving the “Two-Language” Problem....21

Why Not Python?....21

1.3 Essential Python Libraries....22

NumPy....22

pandas....23

matplotlib....24

IPython and Jupyter....24

SciPy....25

scikit-learn....26

statsmodels....26

Other Packages....27

1.4 Installation and Setup....27

Miniconda on Windows....27

GNU/Linux....28

Miniconda on macOS....29

Installing Necessary Packages....29

Integrated Development Environments and Text Editors....30

1.5 Community and Conferences....31

1.6 Navigating This Book....32

Code Examples....33

Data for Examples....33

Import Conventions....34

Chapter 2. Python Language Basics, IPython, and Jupyter Notebooks....35

2.1 The Python Interpreter....36

2.2 IPython Basics....37

Running the IPython Shell....37

Running the Jupyter Notebook....38

Tab Completion....41

Introspection....43

2.3 Python Language Basics....44

Language Semantics....44

Scalar Types....52

Control Flow....60

2.4 Conclusion....63

Chapter 3. Built-In Data Structures, Functions, and Files....65

3.1 Data Structures and Sequences....65

Tuple....65

List....69

Dictionary....73

Set....77

Built-In Sequence Functions....80

List, Set, and Dictionary Comprehensions....81

3.2 Functions....83

Namespaces, Scope, and Local Functions....85

Returning Multiple Values....86

Functions Are Objects....87

Anonymous (Lambda) Functions....88

Generators....89

Errors and Exception Handling....92

3.3 Files and the Operating System....94

Bytes and Unicode with Files....98

3.4 Conclusion....100

Chapter 4. NumPy Basics: Arrays and Vectorized Computation....101

4.1 The NumPy ndarray: A Multidimensional Array Object....103

Creating ndarrays....104

Data Types for ndarrays....106

Arithmetic with NumPy Arrays....109

Basic Indexing and Slicing....110

Boolean Indexing....115

Fancy Indexing....118

Transposing Arrays and Swapping Axes....120

4.2 Pseudorandom Number Generation....121

4.3 Universal Functions: Fast Element-Wise Array Functions....123

4.4 Array-Oriented Programming with Arrays....126

Expressing Conditional Logic as Array Operations....128

Mathematical and Statistical Methods....129

Methods for Boolean Arrays....131

Sorting....132

Unique and Other Set Logic....133

4.5 File Input and Output with Arrays....134

4.6 Linear Algebra....134

4.7 Example: Random Walks....136

Simulating Many Random Walks at Once....138

4.8 Conclusion....139

Chapter 5. Getting Started with pandas....141

5.1 Introduction to pandas Data Structures....142

Series....142

DataFrame....147

Index Objects....154

5.2 Essential Functionality....156

Reindexing....156

Dropping Entries from an Axis....159

Indexing, Selection, and Filtering....160

Arithmetic and Data Alignment....170

Function Application and Mapping....176

Sorting and Ranking....178

Axis Indexes with Duplicate Labels....182

5.3 Summarizing and Computing Descriptive Statistics....183

Correlation and Covariance....186

Unique Values, Value Counts, and Membership....188

5.4 Conclusion....191

Chapter 6. Data Loading, Storage, and File Formats....193

6.1 Reading and Writing Data in Text Format....193

Reading Text Files in Pieces....200

Writing Data to Text Format....202

Working with Other Delimited Formats....203

JSON Data....205

XML and HTML: Web Scraping....207

6.2 Binary Data Formats....211

Reading Microsoft Excel Files....212

Using HDF5 Format....213

6.3 Interacting with Web APIs....215

6.4 Interacting with Databases....217

6.5 Conclusion....219

Chapter 7. Data Cleaning and Preparation....221

7.1 Handling Missing Data....221

Filtering Out Missing Data....223

Filling In Missing Data....225

7.2 Data Transformation....227

Removing Duplicates....227

Transforming Data Using a Function or Mapping....229

Replacing Values....230

Renaming Axis Indexes....232

Discretization and Binning....233

Detecting and Filtering Outliers....235

Permutation and Random Sampling....237

Computing Indicator/Dummy Variables....239

7.3 Extension Data Types....242

7.4 String Manipulation....245

Python Built-In String Object Methods....245

Regular Expressions....247

String Functions in pandas....250

7.5 Categorical Data....253

Background and Motivation....254

Categorical Extension Type in pandas....255

Computations with Categoricals....258

Categorical Methods....260

7.6 Conclusion....263

Chapter 8. Data Wrangling: Join, Combine, and Reshape....265

8.1 Hierarchical Indexing....265

Reordering and Sorting Levels....268

Summary Statistics by Level....269

Indexing with a DataFrame’s columns....270

8.2 Combining and Merging Datasets....271

Database-Style DataFrame Joins....272

Merging on Index....277

Concatenating Along an Axis....281

Combining Data with Overlap....286

8.3 Reshaping and Pivoting....288

Reshaping with Hierarchical Indexing....288

Pivoting “Long” to “Wide” Format....291

Pivoting “Wide” to “Long” Format....295

8.4 Conclusion....297

Chapter 9. Plotting and Visualization....299

9.1 A Brief matplotlib API Primer....300

Figures and Subplots....301

Colors, Markers, and Line Styles....306

Ticks, Labels, and Legends....308

Annotations and Drawing on a Subplot....312

Saving Plots to File....314

matplotlib Configuration....315

9.2 Plotting with pandas and seaborn....316

Line Plots....316

Bar Plots....319

Histograms and Density Plots....327

Scatter or Point Plots....329

Facet Grids and Categorical Data....332

9.3 Other Python Visualization Tools....335

9.4 Conclusion....335

Chapter 10. Data Aggregation and Group Operations....337

10.1 How to Think About Group Operations....338

Iterating over Groups....342

Selecting a Column or Subset of Columns....344

Grouping with Dictionaries and Series....345

Grouping with Functions....346

Grouping by Index Levels....346

10.2 Data Aggregation....347

Column-Wise and Multiple Function Application....349

Returning Aggregated Data Without Row Indexes....353

10.3 Apply: General split-apply-combine....353

Suppressing the Group Keys....356

Quantile and Bucket Analysis....356

Example: Filling Missing Values with Group-Specific Values....358

Example: Random Sampling and Permutation....361

Example: Group Weighted Average and Correlation....362

Example: Group-Wise Linear Regression....365

10.4 Group Transforms and “Unwrapped” GroupBys....365

10.5 Pivot Tables and Cross-Tabulation....369

Cross-Tabulations: Crosstab....372

10.6 Conclusion....373

Chapter 11. Time Series....375

11.1 Date and Time Data Types and Tools....376

Converting Between String and Datetime....377

11.2 Time Series Basics....379

Indexing, Selection, Subsetting....381

Time Series with Duplicate Indices....383

11.3 Date Ranges, Frequencies, and Shifting....384

Generating Date Ranges....385

Frequencies and Date Offsets....388

Shifting (Leading and Lagging) Data....389

11.4 Time Zone Handling....392

Time Zone Localization and Conversion....393

Operations with Time Zone-Aware Timestamp Objects....395

Operations Between Different Time Zones....396

11.5 Periods and Period Arithmetic....397

Period Frequency Conversion....398

Quarterly Period Frequencies....400

Converting Timestamps to Periods (and Back)....402

Creating a PeriodIndex from Arrays....403

11.6 Resampling and Frequency Conversion....405

Downsampling....406

Upsampling and Interpolation....409

Resampling with Periods....410

Grouped Time Resampling....412

11.7 Moving Window Functions....414

Exponentially Weighted Functions....417

Binary Moving Window Functions....419

User-Defined Moving Window Functions....420

11.8 Conclusion....421

Chapter 12. Introduction to Modeling Libraries in Python....423

12.1 Interfacing Between pandas and Model Code....423

12.2 Creating Model Descriptions with Patsy....426

Data Transformations in Patsy Formulas....428

Categorical Data and Patsy....430

12.3 Introduction to statsmodels....433

Estimating Linear Models....433

Estimating Time Series Processes....437

12.4 Introduction to scikit-learn....438

12.5 Conclusion....441

Chapter 13. Data Analysis Examples....443

13.1 Bitly Data from 1.USA.gov....443

Counting Time Zones in Pure Python....444

Counting Time Zones with pandas....446

13.2 MovieLens 1M Dataset....453

Measuring Rating Disagreement....457

13.3 US Baby Names 1880–2010....461

Analyzing Naming Trends....466

13.4 USDA Food Database....475

13.5 2012 Federal Election Commission Database....481

Donation Statistics by Occupation and Employer....484

Bucketing Donation Amounts....487

Donation Statistics by State....489

13.6 Conclusion....490

Appendix A. Advanced NumPy....491

A.1 ndarray Object Internals....491

NumPy Data Type Hierarchy....492

A.2 Advanced Array Manipulation....494

Reshaping Arrays....494

C Versus FORTRAN Order....496

Concatenating and Splitting Arrays....497

Repeating Elements: tile and repeat....499

Fancy Indexing Equivalents: take and put....501

A.3 Broadcasting....502

Broadcasting over Other Axes....505

Setting Array Values by Broadcasting....507

A.4 Advanced ufunc Usage....508

ufunc Instance Methods....508

Writing New ufuncs in Python....511

A.5 Structured and Record Arrays....511

Nested Data Types and Multidimensional Fields....512

Why Use Structured Arrays?....513

A.6 More About Sorting....513

Indirect Sorts: argsort and lexsort....515

Alternative Sort Algorithms....516

Partially Sorting Arrays....517

numpy.searchsorted: Finding Elements in a Sorted Array....518

A.7 Writing Fast NumPy Functions with Numba....519

Creating Custom numpy.ufunc Objects with Numba....520

A.8 Advanced Array Input and Output....521

Memory-Mapped Files....521

HDF5 and Other Array Storage Options....522

A.9 Performance Tips....523

The Importance of Contiguous Memory....523

Appendix B. More on the IPython System....527

B.1 Terminal Keyboard Shortcuts....527

B.2 About Magic Commands....528

The %run Command....530

Executing Code from the Clipboard....531

B.3 Using the Command History....532

Searching and Reusing the Command History....532

Input and Output Variables....533

B.4 Interacting with the Operating System....534

Shell Commands and Aliases....535

Directory Bookmark System....536

B.5 Software Development Tools....537

Interactive Debugger....537

Timing Code: %time and %timeit....541

Basic Profiling: %prun and %run -p....543

Profiling a Function Line by Line....545

B.6 Tips for Productive Code Development Using IPython....547

Reloading Module Dependencies....547

Code Design Tips....548

B.7 Advanced IPython Features....550

Profiles and Configuration....550

B.8 Conclusion....551

Index....553

About the Author....580

Colophon....581

Get the definitive handbook for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3.10 and pandas 1.4, the third edition of this hands-on guide is packed with practical case studies that show you how to solve a broad set of data analysis problems effectively. You'll learn the latest versions of pandas, NumPy, and Jupyter in the process.

Written by Wes McKinney, the creator of the Python pandas project, this book is a practical, modern introduction to data science tools in Python. It's ideal for analysts new to Python and for Python programmers new to data science and scientific computing. Data files and related material are available on GitHub.

  • Use the Jupyter notebook and IPython shell for exploratory computing
  • Learn basic and advanced features in NumPy
  • Get started with data analysis tools in the pandas library
  • Use flexible tools to load, clean, transform, merge, and reshape data
  • Create informative visualizations with matplotlib
  • Apply the pandas groupby facility to slice, dice, and summarize datasets
  • Analyze and manipulate regular and irregular time series data
  • Learn how to solve real-world data analysis problems with thorough, detailed examples

Похожее:

Список отзывов:

Нет отзывов к книге.