Data Science Fundamentals with R, Python, and Open Data

Data Science Fundamentals with R, Python, and Open Data

Data Science Fundamentals with R, Python, and Open Data
Автор: Cremonini Marco
Дата выхода: 2024
Издательство: John Wiley & Sons, Inc.
Количество страниц: 480
Размер файла: 2.3 MB
Тип файла: PDF
Добавил: codelibs
 Проверить на вирусы

Cover....1

Title Page....5

Copyright....6

Contents....7

Preface....15

About the Companion Website....19

Introduction....21

Chapter 1 Open‐Source Tools for Data Science....25

1.1 R Language and RStudio....25

1.1.1 R Language....26

1.1.2 RStudio Desktop....26

1.1.3 Package Manager....26

1.1.4 Package Tidyverse....28

1.2 Python Language and Tools....29

1.2.1 Option A: Anaconda Distribution....30

1.2.2 Option B: Manual Installation....30

1.2.3 Google Colab....31

1.2.4 Packages NumPy and Pandas....31

1.3 Advanced Plain Text Editor....32

1.4 CSV Format for Datasets....32

Questions....34

Chapter 2 Simple Exploratory Data Analysis....37

2.1 Missing Values Analysis....37

2.2 R: Descriptive Statistics and Utility Functions....39

2.3 Python: Descriptive Statistics and Utility Functions....41

Questions....43

Chapter 3 Data Organization and First Data Frame Operations....47

3.1 R: Read CSV Datasets and Column Selection....48

3.1.1 Reading a CSV Dataset....50

3.1.1.1 Reading Errors....51

3.1.2 Selection by Column Name....53

3.1.3 Selection by Column Index Position....54

3.1.4 Selection by Range....55

3.1.5 Selection by Exclusion....56

3.1.6 Selection with Selection Helper....59

3.2 R: Rename and Relocate Columns....60

3.3 R: Slicing, Column Creation, and Deletion....62

3.3.1 Subsetting and Slicing....63

3.3.2 Column Creation....66

3.3.3 Column Deletion....67

3.3.4 Calculated Columns....68

3.3.5 Function mutate() and Data Masking....68

3.4 R: Separate and Unite Columns....69

3.4.1 Separation....70

3.4.2 Union....72

3.5 R: Sorting Data Frames....73

3.5.1 Sorting by Multiple Columns....74

3.5.2 Sorting by an External List....75

3.6 R: Pipe....79

3.6.1 Forward Pipe....79

3.6.2 Pipe in Base R....81

3.6.2.1 Variant....81

3.6.3 Parameter Placeholder....82

3.7 Python: Column Selection....83

3.7.1 Selecting Columns from Dataset Read....85

3.7.2 Selecting Columns from a Data Frame....86

3.7.3 Selection by Positional Index, Range, or with Selection Helper....87

3.7.4 Selection by Exclusion....88

3.8 Python: Rename and Relocate Columns....91

3.8.1 Standard Method....91

3.8.2 Functions rename() and reindex()....91

3.9 Python: NumPy Slicing, Selection with Index, Column Creation and Deletion....93

3.9.1 NumPy Array Slicing....93

3.9.2 Slicing of Pandas Data Frames....94

3.9.3 Methods .loc and .iloc....97

3.9.4 Selection with Selection Helper....101

3.9.5 Creating and Deleting Columns....103

3.9.6 Functions insert() and assign()....104

3.10 Python: Separate and Unite Columns....105

3.10.1 Separate....105

3.10.2 Unite....108

3.11 Python: Sorting Data Frame....109

3.11.1 Sorting Columns....109

3.11.2 Sorting Index Levels....110

3.11.3 From Indexed to Non‐indexed Data Frame....112

3.11.4 Sorting by an External List....113

Questions....115

Chapter 4 Subsetting with Logical Conditions....123

4.1 Logical Operators....123

4.2 R: Row Selection....125

4.2.1 Operator %in%....128

4.2.2 Boolean Mask....129

4.2.3 Examples....130

4.2.3.1 Wrong Disjoint Condition....131

4.2.4 Python: Row Selection....138

4.2.5 Boolean Mask, Base Selection Method....139

4.2.6 Row Selection with query()....143

Questions....145

Chapter 5 Operations on Dates, Strings, and Missing Values....151

5.1 R: Operations on Dates and Strings....153

5.1.1 Date and Time....153

5.1.1.1 Datetime Data Type....153

5.1.2 Parsing Dates....154

5.1.3 Using Dates....156

5.1.4 Selection with Logical Conditions on Dates....157

5.1.5 Strings....160

5.2 R: Handling Missing Values and Data Type Transformations....165

5.2.1 Missing Values as Replacement....166

5.2.1.1 Keywords for Missing Values....166

5.2.2 Introducing Missing Values in Dataset Reads....167

5.2.3 Verifying the Presence of Missing Values....168

5.2.3.1 Functions any(), all(), and colSums()....170

5.2.4 Replacing Missing Values....171

5.2.5 Omit Rows with Missing Values....173

5.2.6 Data Type Transformations....174

5.3 R: Example with Dates, Strings, and Missing Values....178

5.3.1 When an Invisible Hand Mess with Your Data....182

5.3.2 Base Method....183

5.3.3 A Better Heuristic....186

5.3.4 Specialized Functions....186

5.3.4.1 Function parse_date_time()....186

5.3.5 Result Comparison....189

5.4 Pyhton: Operations on Dates and Strings....189

5.4.1 Date and Time....189

5.4.1.1 Function pd.to_datetime()....189

5.4.1.2 Function datetime.datetime.strptime()....191

5.4.1.3 Locale Configuration....192

5.4.1.4 Function datetime.datetime.strftime()....193

5.4.1.5 Pandas Timestamp Functions....193

5.4.2 Selection with Logical Conditions on Dates....195

5.4.3 Strings....196

5.5 Python: Handling Missing Values and Data Type Transformations....197

5.5.1 Missing Values as Replacement....197

5.5.1.1 Function pd.replace()....199

5.5.2 Introducing Missing Values in Dataset Reads....199

5.5.3 Verifying the Presence of Missing Values....200

5.5.4 Selection with Missing Values....202

5.5.5 Replacing Missing Values with Actual Values....203

5.5.6 Modifying Values by View or by Copy....204

5.5.7 Data Type Transformations....206

5.6 Python: Examples with Dates, Strings, and Missing Values....206

5.6.1 Example 1: Eurostat....206

5.6.2 Example 2: Open Data Berlin....210

Questions....214

Chapter 6 Pivoting and Wide‐long Transformations....219

6.1 R: Pivoting....221

6.1.1 From Long to Wide....221

6.1.2 From Wide to Long....223

6.1.3 GOV.UK: Gender Pay Gap....224

6.2 Python: Pivoting....226

6.2.1 From Wide to Long with Columns....227

6.2.2 From Long to Wide with Columns....228

6.2.3 Wide‐long Transformation with Index Levels....230

6.2.4 Indexed Data Frame....231

6.2.4.1 Function unstack()....232

6.2.4.2 Function stack()....235

6.2.5 From Long to Wide with Elements of Numeric Type....237

Questions....240

Chapter 7 Groups and Operations on Groups....245

7.1 R: Groups....246

7.1.1 Groups and Group Indexes....248

7.1.1.1 Function group_by()....248

7.1.1.2 Index Details....250

7.1.2 Aggregation Operations....251

7.1.2.1 Functions group_by() and summarize()....251

7.1.2.2 Counting Rows: function n()....252

7.1.2.3 Arithmetic Mean: function mean()....252

7.1.2.4 Maximum and Minimum Values: Functions max() and min()....254

7.1.2.5 Summing Values: function sum()....255

7.1.2.6 List of Aggregation Functions....256

7.1.3 Sorting Within Groups....256

7.1.4 Creation of Columns in Grouped Data Frames....258

7.1.5 Slicing Rows on Groups....260

7.1.5.1 Functions slice_*()....260

7.1.5.2 Combination of Functions filter() and rank()....262

7.1.6 Calculated Columns with Group Values....266

7.2 Python: Groups....268

7.2.1 Group Index and Aggregation Operations....271

7.2.1.1 Functions groupby() and aggregate()....271

7.2.1.2 Counting Rows, Computing Arithmetic Means, and Sum for Each Group....271

7.2.2 Names on Columns with Aggregated Values....275

7.2.3 Sorting Columns....276

7.2.4 Sorting on Index Levels....278

7.2.5 Slicing Rows on Groups....279

7.2.5.1 Functions nlargest() and nsmallest()....283

7.2.6 Calculated Columns with Group Values....283

7.2.7 Sorting Within Groups....285

Questions....289

Chapter 8 Conditions and Iterations....295

8.1 R: Conditions and Iterations....296

8.1.1 Conditions....296

8.1.1.1 Function if_else()....299

8.1.1.2 Function case_when()....300

8.1.1.3 Function if() and Constructs If‐else and If‐else If‐else....301

8.1.2 Iterations....302

8.1.2.1 Function for()....302

8.1.2.2 Function Foreach()....304

8.1.3 Nested Iterations....304

8.1.3.1 Replacing a Single‐Element Value....306

8.1.3.2 Iterate on the First Column....307

8.1.3.3 Iterate on all Columns....307

8.2 Python: Conditions and Iterations....308

8.2.1 Conditions....308

8.2.1.1 Function if()....309

8.2.1.2 Constructs If‐else and If‐elif‐else....309

8.2.1.3 Function np.where()....310

8.2.1.4 Function np.select()....311

8.2.1.5 Functions pd.where() and pd.mask()....313

8.2.2 Iterations....315

8.2.2.1 Functions for() and while()....315

8.2.3 Nested Iterations....318

8.2.3.1 Execution Time....320

8.2.4 Iterating on Multi‐index....321

8.2.4.1 Function join()....324

8.2.4.2 Function items()....325

Questions....326

Chapter 9 Functions and Multicolumn Operations....331

9.1 R: User‐defined Functions....332

9.1.1 Using Functions....333

9.1.2 Data Masking....336

9.1.3 Anonymous Functions....339

9.2 R: Multicolumn Operations....340

9.2.1 Base Method....340

9.2.1.1 Functions apply(), lapply(), and sapply()....340

9.2.1.2 Mapping....343

9.2.2 Mapping and Anonymous Functions: purrr‐style Syntax....345

9.2.3 Conditional Mapping....345

9.2.4 Subsetting Rows with Multicolumn Logical Condition....347

9.2.4.1 Combination of Functions filter() and if_any()....347

9.2.5 Multicolumn Transformations....348

9.2.5.1 Combination of Functions mutate() and across()....348

9.2.6 Introducing Missing Values....349

9.2.7 Use Cases and Execution Time Measurement....350

9.2.7.1 Case 1....351

9.2.7.2 Case 2....352

9.3 Python: User‐defined and Lambda Functions....354

9.3.1 User‐defined Functions....354

9.3.1.1 Lambda Functions....357

9.3.2 Python: Multicolumn Operations....358

9.3.2.1 Execution Time....360

9.3.3 General Case....361

9.3.3.1 Function apply()....361

Questions....366

Chapter 10 Join Data Frames....371

10.1 Basic Concepts....372

10.1.1 Keys of a Join Operation....373

10.1.2 Types of Join....374

10.1.3 R: Join Operation....375

10.1.4 Join Functions....378

10.1.4.1 Function inner_join()....378

10.1.4.2 Function full_join()....380

10.1.4.3 Functions left_join() and right_join()....381

10.1.4.4 Function merge()....381

10.1.5 Duplicated Keys....382

10.1.6 Special Join Functions....387

10.1.6.1 Semi Join....387

10.1.6.2 Anti Join....389

10.2 Python: Join Operations....393

10.2.1.1 Function merge()....395

10.2.1.2 Inner Join....396

10.2.1.3 Outer/Full Join....398

10.2.2 Join Operations with Indexed Data Frames....399

10.2.3 Duplicated Keys....402

10.2.4 Special Join Types....408

10.2.4.1 Semi Join: Function isin()....408

10.2.4.2 Anti Join: Variants....410

Questions....413

Chapter 11 List/Dictionary Data Format....417

11.1 R: List Data Format....419

11.1.1 Transformation of List Columns to Ordinary Rows and Columns....425

11.1.1.1 Other Options....427

11.1.2 Function map in List Column Transformations....430

11.2 R: JSON Data Format and Use Cases....434

11.2.1 Memory Problem when Reading Very Large Datasets....445

11.3 Python: Dictionary Data Format....446

11.3.1 Methods....448

11.3.2 From Dictionary to Data Frame With a Single Level of Nesting....451

11.3.2.1 Functions pd.Dataframe() and pd.Dataframe.from_dict()....451

11.3.3 From Dictionary to Data Frame with Several Levels of Nesting....453

11.3.3.1 Function pd.json_normalize() and Join Operation....453

11.3.4 Python: Use Cases with JSON Datasets....460

Questions....467

Index....471

EULA....480

Organized with a strong focus on open data, Data Science Fundamentals with R, Python, and Open Data discusses concepts, techniques, tools, and first steps to carry out data science projects, with a focus on Python and RStudio, reflecting a clear industry trend emerging towards the integration of the two. The text examines intricacies and inconsistencies often found in real data, explaining how to recognize them and guiding readers through possible solutions, and enables readers to handle real data confidently and apply transformations to reorganize, indexing, aggregate, and elaborate.

This book is full of reader interactivity, with a companion website hosting supplementary material including datasets used in the examples and complete running code (R scripts and Jupyter notebooks) of all examples. Exam-style questions are implemented and multiple choice questions to support the readers’ active learning. Each chapter presents one or more case studies.

Written by a highly qualified academic, Data Science Fundamentals with R, Python, and Open Data discuss sample topics such as:

  • Data organization and operations on data frames, covering reading CSV dataset and common errors, and slicing, creating, and deleting columns in R
  • Logical conditions and row selection, covering selection of rows with logical condition and operations on dates, strings, and missing values
  • Pivoting operations and wide form-long form transformations, indexing by groups with multiple variables, and indexing by group and aggregations
  • Conditional statements and iterations, multicolumn functions and operations, data frame joins, and handling data in list/dictionary format

Data Science Fundamentals with R, Python, and Open Data is a highly accessible learning resource for students from heterogeneous disciplines where Data Science and quantitative, computational methods are gaining popularity, along with hard sciences not closely related to computer science, and medical fields using stochastic and quantitative models.


Похожее:

Список отзывов:

Нет отзывов к книге.