Cover....2
Title Page....6
Copyright Page....7
Dedication....8
Contents....9
Preface....19
Chapter 1: Introduction to Python....25
Tools for Python....25
easy_install and pip....26
virtualenv....26
IPython....26
Python Installation....27
Setting the PATH Environment Variable (Windows Only)....28
Launching Python on Your Machine....28
The Python Interactive Interpreter....29
Python Identifiers....30
Lines, Indentation, and Multi-Lines....30
Quotation and Comments....31
Saving Your Code in a Module....33
Some Standard Modules....34
The help() and dir() Functions....35
Compile Time and Runtime Code Checking....36
Simple Data Types....37
Working with Numbers....37
Working with Other Bases....39
The chr() Function....39
The round() Function in Python....40
Formatting Numbers in Python....41
Working with Fractions....41
Unicode and UTF-8....42
Working with Unicode....42
Working with Strings....43
Comparing Strings....45
Formatting Strings in Python....46
Uninitialized Variables and the Value None....46
Slicing and Splicing Strings....46
Testing for Digits and Alphabetic Characters....47
Search and Replace a String in Other Strings....48
Remove Leading and Trailing Characters....49
Printing Text Without NewLine Characters....50
Text Alignment....51
Working with Dates....52
Converting Strings to Dates....54
Exception Handling....54
Handling User Input....56
Command-Line Arguments....58
Summary....60
Chapter 2: Working with Data....61
Dealing with Data: What Can Go Wrong?....62
What is Data Drift?....62
What are Datasets?....63
Data Preprocessing....64
Data Types....65
Preparing Datasets....66
Discrete Data vs. Continuous Data....67
“Binning” Continuous Data....68
Scaling Numeric Data via Normalization....68
Scaling Numeric Data via Standardization....70
Scaling Numeric Data via Robust Standardization....71
What to Look for in Categorical Data....71
Mapping Categorical Data to Numeric Values....72
Working with Dates....74
Working with Currency....75
Working with Outliers and Anomalies....75
Outlier Detection/Removal....76
Finding Outliers with NumPy....78
Finding Outliers with Pandas....81
Calculating Z-Scores to Find Outliers....84
Finding Outliers with SkLearn (Optional)....85
Working with Missing Data....87
Imputing Values: When is Zero a Valid Value?....89
Dealing with Imbalanced Datasets....90
What is SMOTE?....91
SMOTE Extensions....92
The Bias-Variance Tradeoff....92
Types of Bias in Data....94
Analyzing Classifiers (Optional)....95
What is LIME?....95
What is ANOVA?....96
Summary....97
Chapter 3: Introduction to Pandas....98
What is Pandas?....98
Pandas Data Frames....99
Data Frames and Data Cleaning Tasks....99
A Pandas Data Frame Example....100
Describing a Pandas Data Frame....102
Pandas Boolean Data Frames....104
Transposing a Pandas Data Frame....105
Pandas Data Frames and Random Numbers....106
Converting Categorical Data to Numeric Data....108
Merging and Splitting Columns in Pandas....112
Combining Pandas Data Frames....114
Data Manipulation with Pandas Data Frames....115
Pandas Data Frames and CSV Files....116
Useful Options for the Pandas read_csv() Function....119
Reading Selected Rows from CSV Files....120
Pandas Data Frames and Excel Spreadsheets....123
Useful Options for Reading Excel Spreadsheets....124
Select, Add, and Delete Columns in Data Frames....125
Handling Outliers in Pandas....127
Pandas Data Frames and Simple Statistics....129
Finding Duplicate Rows in Pandas....130
Finding Missing Values in Pandas....133
Missing Values in an Iris-Based Dataset....135
Sorting Data Frames in Pandas....139
Working with groupby() in Pandas....141
Aggregate Operations with the titanic.csv Dataset....142
Working with apply() and mapapply() in Pandas....145
Useful One-line Commands in Pandas....148
Working with JSON-based Data....150
Python Dictionary and JSON....151
Python, Pandas, and JSON....152
Summary....153
Chapter 4: RDBMS and SQL....155
What is an RDBMS?....155
What Relationships Do Tables Have in an RDBMS?....155
Features of an RDBMS....156
What is ACID?....157
When Do We Need an RDBMS?....157
The Importance of Normalization....159
A Four-Table RDBMS....161
Detailed Table Descriptions....162
The customers Table....163
The purchase_orders Table....164
The line_items Table....165
The item_desc Table....167
What is SQL?....168
DCL, DDL, DQL, DML, and TCL....169
SQL Privileges....170
Properties of SQL Statements....170
The CREATE Keyword....171
What is MySQL?....171
What about MariaDB?....172
Installing MySQL....172
Data Types in MySQL....173
The CHAR and VARCHAR Data Types....173
String-based Data Types....174
FLOAT and DOUBLE Data Types....174
BLOB and TEXT Data Types....175
MySQL Database Operations....175
Creating a Database....175
Display a List of Databases....176
Display a List of Database Users....176
Dropping a Database....177
Exporting a Database....177
Renaming a Database....179
The INFORMATION_SCHEMA Table....180
The PROCESSLIST Table....181
SQL Formatting Tools....181
Summary....182
Chapter 5: Java, JSON, and XML....184
Working with Java and MySQL....185
Performing the Set-up Steps....185
Creating a MySQL Database in Java....186
Creating a MySQL Table in Java....188
Inserting Data into a MySQL Table in Java....190
Deleting Data and Dropping MySQL Tables in Java....192
Selecting Data from a MySQL Table in Java....194
Updating Data in a MySQL Table in Java....196
Working with JSON, MySQL, and Java....198
Select JSON-based Data from a MySQL Table in Java....199
Working with XML, MySQL, and Java....201
What is XML?....201
What is an XML Schema?....202
When are XML Schemas Useful?....203
Create a MySQL Table for XML Data in Java....204
Read an XML Document in Java....207
Read an XML Document as a String in Java....208
Insert XML-based Data into a MySQL Table in Java....210
Select XML-based Data from a MySQL Table in Java....213
Parse XML-based String Data from a MySQL Table in Java....215
Working with XML Schemas....218
Summary....219
Chapter 6: Data Cleaning Tasks....221
What is Data Cleaning?....222
Data Cleaning for Personal Titles....224
Data Cleaning in SQL....225
Replace NULL with 0....226
Replace NULL Values with Average Value....226
Replace Multiple Values with a Single Value....228
Handle Mismatched Attribute Values....229
Convert Strings to Date Values....231
Data Cleaning from the Command Line (Optional)....233
Working with the sed Utility....233
Working with Variable Column Counts....235
Truncating Rows in CSV Files....237
Generating Rows with Fixed Columns with the awk Utility....239
Converting Phone Numbers....241
Converting Numeric Date Formats....244
Converting Alphabetic Date Formats....248
Working with Date and Time Date Formats....251
Working with Codes, Countries, and Cities....257
Data Cleaning on a Kaggle Dataset....264
Summary....268
Chapter 7: Data Wrangling....269
What is Data Wrangling?....270
Data Transformation: What Does This Mean?....270
CSV Files with Multi-Row Records....273
Pandas Solution (1)....273
Pandas Solution (2)....274
CSV Solution....275
CSV Files, Multi-row Records, and the awk Command....276
Quoted Fields Split on Two Lines (Optional)....277
Overview of the Events Project....281
Why This Project?....282
Project Tasks....283
Generate Country Codes....284
Prepare a List of Cities in Countries....284
Generating City Codes from Country Codes: awk....285
Generating City Codes from Country Codes: Python....289
Generating SQL Statements for the city_codes Table....292
Generating a CSV File for Band Members (Java)....293
Generating a CSV File for Band Members (Python)....297
Generating a Calendar of Events (COE)....300
Project Automation Script....304
Project Follow-up Comments....306
Summary....308
Appendix A: Working with awk....309
The awk Command....310
Built-in Variables That Control awk....310
How Does the awk Command Work?....311
Aligning Text with the printf() Statement....312
Conditional Logic and Control Statements....314
The while Statement....314
A for Loop in awk....315
A for Loop with a break Statement....316
The next and continue Statements....316
Deleting Alternate Lines in Datasets....317
Merging Lines in Datasets....317
Printing File Contents as a Single Line....318
Joining Groups of Lines in a Text File....319
Joining Alternate Lines in a Text File....320
Matching with Meta Characters and Character Sets....321
Printing Lines Using Conditional Logic....322
Splitting Filenames with awk....323
Working with Postfix Arithmetic Operators....324
Numeric Functions in awk....325
One-line awk Commands....328
Useful Short awk Scripts....329
Printing the Words in a Text String in awk....331
Count Occurrences of a String in Specific Rows....331
Printing a String in a Fixed Number of Columns....333
Printing a Dataset in a Fixed Number of Columns....333
Aligning Columns in Datasets....334
Aligning Columns and Multiple Rows in Datasets....335
Removing a Column from a Text File....337
Subsets of Column-aligned Rows in Datasets....338
Counting Word Frequency in Datasets....339
Displaying Only “Pure” Words in a Dataset....341
Working with Multi-line Records in awk....343
A Simple Use Case....344
Another Use Case....346
Summary....347
Index....349
This book is intended primarily for those who plan to become data scientists as wellas anyone who needs to perform data cleaning tasks. It contains a variety of features of NumPy and Pandas and how to create databases and tables in MySQL. Chapter 7 covers many data wrangling tasks using Python scripts and awk-based shell scripts. Companion files with code are available for downloading from the publisher.