Cover....1
Half-Title....2
Title....4
Copyright....5
Dedication....6
Contents....8
Preface....20
Chapter 1: Introduction to Python....26
Tools for Python....26
easy_install and pip....26
virtualenv....27
Python Installation....27
Setting the PATH Environment Variable (Windows Only)....28
Launching Python on Your Machine....28
The Python Interactive Interpreter....28
Python Identifiers....29
Lines, Indentations, and Multi-Lines....30
Quotation and Comments in Python....30
Saving Your Code in a Module....32
Some Standard Modules in Python....33
The help() and dir() Functions....33
Compile Time and Runtime Code Checking....34
Simple Data Types in Python....35
Working with Numbers....35
Working with Other Bases....37
The chr() Function....37
The round() Function in Python....38
Formatting Numbers in Python....38
Unicode and UTF-8....39
Working with Unicode....39
Listing 1.1: Unicode1.py....40
Working with Strings....40
Comparing Strings....41
Listing 1.2: Compare.py....42
Formatting Strings in Python....42
Uninitialized Variables and the Value None in Python....42
Slicing and Splicing Strings....43
Testing for Digits and Alphabetic Characters....43
Listing 1.3: CharTypes.py....43
Search and Replace a String in Other Strings....44
Listing 1.4: FindPos1.py....44
Listing 1.5: Replace1.py....45
Remove Leading and Trailing Characters....45
Listing 1.6: Remove1.py....45
Printing Text without NewLine Characters....46
Text Alignment....47
Working with Dates....48
Listing 1.7: Datetime2.py....48
Listing 1.8: datetime2.out....48
Converting Strings to Dates....49
Listing 1.9: String2Date.py....49
Exception Handling in Python....49
Listing 1.10: Exception1.py....50
Handling User Input....51
Listing 1.11: UserInput1.py....51
Listing 1.12: UserInput2.py....52
Listing 1.13: UserInput3.py....52
Command-Line Arguments....53
Listing 1.14: Hello.py....54
Summary....54
Chapter 2: Introduction to NumPy....56
What is NumPy?....57
Useful NumPy Features....57
What are NumPy Arrays?....57
Listing 2.1: nparray1.py....58
Working with Loops....58
Listing 2.2: loop1.py....58
Appending Elements to Arrays (1)....59
Listing 2.3: append1.py....59
Appending Elements to Arrays (2)....60
Listing 2.4: append2.py....60
Multiplying Lists and Arrays....60
Listing 2.5: multiply1.py....61
Doubling the Elements in a List....61
Listing 2.6: double_list1.py....61
Lists and Exponents....62
Listing 2.7: exponent_list1.py....62
Arrays and Exponents....62
Listing 2.8: exponent_array1.py....62
Math Operations and Arrays....63
Listing 2.9: mathops_array1.py....63
Working with “−1” Sub-ranges With Vectors....63
Listing 2.10: npsubarray2.py....63
Working with “−1” Sub-ranges with Arrays....64
Listing 2.11: np2darray2.py....64
Other Useful NumPy Methods....64
Arrays and Vector Operations....65
Listing 2.12: array_vector.py....65
NumPy and Dot Products (1)....66
Listing 2.13: dotproduct1.py....66
NumPy and Dot Products (2)....67
Listing 2.14: dotproduct2.py....67
NumPy and the Length of Vectors....67
Listing 2.15: array_norm.py....68
NumPy and Other Operations....68
Listing 2.16: otherops.py....69
NumPy and the reshape() Method....69
Listing 2.17: numpy_reshape.py....69
Calculating the Mean and Standard Deviation....70
Listing 2.18: sample_mean_std.py....71
Code Sample with Mean and Standard Deviation....71
Listing 2.19: stat_values.py....72
Trimmed Mean and Weighted Mean....72
Working with Lines in the Plane (Optional)....73
Plotting Randomized Points with NumPy and Matplotlib....75
Listing 2.20: np_plot.py....76
Plotting a Quadratic with NumPy and Matplotlib....76
Listing 2.21: np_plot_quadratic.py....76
What is Linear Regression?....77
What is Multivariate Analysis?....78
What about Non-Linear Datasets?....78
The MSE (Mean Squared Error) Formula....79
Other Error Types....80
Non-Linear Least Squares....81
Calculating the MSE Manually....81
Find the Best-Fitting Line in NumPy....82
Listing 2.22: find_best_fit.py....83
Calculating MSE by Successive Approximation (1)....83
Listing 2.23: plain_linreg1.py....84
Calculating MSE by Successive Approximation (2)....86
Listing 2.24: plain_linreg2.py....86
Google Colaboratory....88
Uploading CSV Files in Google Colaboratory....90
Listing 2.25: upload_csv_file.ipynb....90
Summary....91
Chapter 3: Introduction to Pandas....92
What is Pandas?....92
Pandas Options and Settings....93
Pandas Data Frames....93
Data Frames and Data Cleaning Tasks....94
Alternatives to Pandas....94
A Pandas Data Frame with a NumPy Example....95
Listing 3.1: pandas_df.py....95
Describing a Pandas Data Frame....97
Listing 3.2: pandas_df_describe.py....97
Pandas Boolean Data Frames....99
Listing 3.3: pandas_boolean_df.py....99
Transposing a Pandas Data Frame....100
Pandas Data Frames and Random Numbers....101
Listing 3.4: pandas_random_df.py....101
Listing 3.5: pandas_combine_df.py....101
Reading CSV Files in Pandas....102
Listing 3.6: sometext.txt....102
Listing 3.7: read_csv_file.py....103
The loc() and iloc() Methods in Pandas....103
Converting Categorical Data to Numeric Data....104
Listing 3.8: cat2numeric.py....104
Listing 3.9: shirts.csv....105
Listing 3.10: shirts.py....105
Matching and Splitting Strings in Pandas....107
Listing 3.11: shirts_str.py....107
Converting Strings to Dates in Pandas....110
Listing 3.12: string2date.py....110
Merging and Splitting Columns in Pandas....111
Listing 3.13: employees.csv....111
Listing 3.14: emp_merge_split.py....111
Combining Pandas Data Frames....113
Listing 3.15: concat_frames.py....113
Data Manipulation with Pandas Data Frames (1)....113
Listing 3.16: pandas_quarterly_df1.py....114
Data Manipulation with Pandas Data Frames (2)....115
Listing 3.17: pandas_quarterly_df2.py....115
Data Manipulation with Pandas Data Frames (3)....116
Listing 3.18: pandas_quarterly_df3.py....116
Pandas Data Frames and CSV Files....117
Listing 3.19: weather_data.py....117
Listing 3.20: people.csv....118
Listing 3.21: people_pandas.py....118
Managing Columns in Data Frames....119
Switching Columns....120
Appending Columns....120
Deleting Columns....121
Inserting Columns....121
Scaling Numeric Columns....122
Listing 3.22: numbers.csv....122
Listing 3.23: scale_columns.py....123
Managing Rows in Pandas....124
Selecting a Range of Rows in Pandas....124
Listing 3.24: duplicates.csv....124
Listing 3.25: row_range.py....125
Finding Duplicate Rows in Pandas....126
Listing 3.26: duplicates.py....126
Listing 3.27: drop_duplicates.py....127
Inserting New Rows in Pandas....129
Listing 3.28: emp_ages.csv....129
Listing 3.29: insert_row.py....129
Handling Missing Data in Pandas....129
Listing 3.30: employees2.csv....130
Listing 3.31: missing_values.py....130
Multiple Types of Missing Values....132
Listing 3.32: employees3.csv....132
Listing 3.33: missing_multiple_types.py....132
Test for Numeric Values in a Column....132
Listing 3.34: test_for_numeric.py....133
Replacing NaN Values in Pandas....133
Listing 3.35: missing_fill_drop.py....133
Sorting Data Frames in Pandas....135
Listing 3.36: sort_df.py....135
Working with groupby() in Pandas....137
Listing 3.37: groupby1.py....137
Working with apply() and mapapply() in Pandas....138
Listing 3.38: apply1.py....139
Listing 3.39: apply2.py....140
Listing 3.40: mapapply1.py....140
Listing 3.41: mapapply2.py....141
Handling Outliers in Pandas....142
Listing 3.42: outliers_zscores.py....142
Pandas Data Frames and Scatterplots....144
Listing 3.43: pandas_scatter_df.py....144
Pandas Data Frames and Simple Statistics....145
Listing 3.44: housing.csv....145
Listing 3.45: housing_stats.py....145
Aggregate Operations in Pandas Data Frames....146
Listing 3.46: aggregate1.py....147
Aggregate Operations with the titanic.csv Dataset....148
Listing 3.47: aggregate2.py....148
Save Data Frames as CSV Files and Zip Files....150
Listing 3.48: save2csv.py....150
Pandas Data Frames and Excel Spreadsheets....151
Listing 3.49: write_people_xlsx.py....151
Listing 3.50: read_people_xslx.py....151
Working with JSON-based Data....152
Python Dictionary and JSON....152
Listing 3.51: dict2json.py....152
Python, Pandas, and JSON....153
Listing 3.52: pd_python_json.py....153
Useful One-line Commands in Pandas....154
What is Method Chaining?....155
Pandas and Method Chaining....156
Pandas Profiling....156
Listing 3.53: titanic.csv....156
Listing 3.54: profile_titanic.py....157
Summary....157
Chapter 4: Working with Sklearn and Scipy....158
What is Sklearn?....158
Sklearn Features....159
The Digits Dataset in Sklearn....160
Listing 4.1: load_digits1.py....160
Listing 4.2: load_digits2.py....161
Listing 4.3: sklearn_digits.py....162
The train_test_split() Class in Sklearn....163
Selecting Columns for X and y....164
What is Feature Engineering?....164
The Iris Dataset in Sklearn (1)....165
Listing 4.4: sklearn_iris1.py....165
Sklearn, Pandas, and the Iris Dataset....167
Listing 4.5: pandas_iris.py....167
The Iris Dataset in Sklearn (2)....169
Listing 4.6: sklearn_iris2.py....169
The Faces Dataset in Sklearn (Optional)....171
Listing 4.7: sklearn_faces.py....171
What is SciPy?....173
Installing SciPy....173
Permutations and Combinations in SciPy....174
Listing 4.8: scipy_perms.py....174
Listing 4.9: scipy_combinatorics.py....174
Calculating Log Sums....175
Listing 4.10: scipy_matrix_inv.py....175
Calculating Polynomial Values....175
Listing 4.11: scipy_poly.py....175
Calculating the Determinant of a Square Matrix....176
Listing 4.12: scipy_determinant.py....176
Calculating the Inverse of a Matrix....177
Listing 4.13: scipy_matrix_inv.py....177
Calculating Eigenvalues and Eigenvectors....177
Listing 4.14: scipy_eigen.py....177
Calculating Integrals (Calculus)....178
Listing 4.15: scipy_integrate.py....178
Calculating Fourier Transforms....179
Listing 4.16: scipy_fourier.py....179
Flipping Images in SciPy....180
Listing 4.17: scipy_flip_image.py....180
Rotating Images in SciPy....181
Listing 4.18: scipy_rotate_image.py....181
Google Colaboratory....182
Uploading CSV Files in Google Colaboratory....183
Listing 4.19: upload_csv_file.ipynb....183
Summary....184
Chapter 5: Data Cleaning Tasks....186
What is Data Cleaning?....187
Data Cleaning for Personal Titles....188
Data Cleaning in SQL....189
Replace NULL with 0....190
Replace NULL Values with the Average Value....190
Listing 5.1: replace_null_values.sql....190
Replace Multiple Values with a Single Value....192
Listing 5.2: reduce_values.sql....192
Handle Mismatched Attribute Values....193
Listing 5.3: type_mismatch.sql....193
Convert Strings to Date Values....195
Listing 5.4: str_to_date.sql....195
Data Cleaning from the Command Line (optional)....197
Working with the sed Utility....197
Listing 5.5: delimiter1.txt....197
Listing 5.6: delimiter1.sh....197
Working with Variable Column Counts....199
Listing 5.7: variable_columns.csv....199
Listing 5.8: variable_columns.sh....199
Listing 5.9: variable_columns2.sh....200
Truncating Rows in CSV Files....201
Listing 5.10: variable_columns3.sh....201
Generating Rows with Fixed Columns with the awk Utility....202
Listing 5.11: FixedFieldCount1.sh....202
Listing 5.12: employees.txt....203
Listing 5.13: FixedFieldCount2.sh....203
Converting Phone Numbers....204
Listing 5.14: phone_numbers.txt....204
Listing 5.15: phone_numbers.sh....205
Converting Numeric Date Formats....206
Listing 5.16: dates.txt....207
Listing 5.17: dates.sh....207
Listing 5.18: dates2.sh....209
Converting Alphabetic Date Formats....211
Listing 5.19: dates2.txt....211
Listing 5.20: dates3.sh....211
Working with Date and Time Date Formats....213
Listing 5.21: date-times.txt....214
Listing 5.22: date-times-padded.sh....214
Working with Codes, Countries, and Cities....220
Listing 5.23: country_codes.csv....220
Listing 5.24: add_country_codes.sh....220
Listing 5.25: countries_cities.csv....221
Listing 5.26: split_countries_codes.sh....222
Listing 5.27: countries_cities2.csv....223
Listing 5.28: split_countries_codes2.sh....223
Data Cleaning on a Kaggle Dataset....226
Listing 5.29: convert_marketing.sh....226
Summary....229
Chapter 6: Data Visualization....230
What is Data Visualization?....230
Types of Data Visualization....231
What is Matplotlib?....232
Diagonal Lines in Matplotlib....232
Listing 6.1: diagonallines.py....232
A Colored Grid in Matplotlib....233
Listing 6.2: plotgrid2.py....233
Randomized Data Points in Matplotlib....234
Listing 6.3: lin_plot_reg.py....234
A Histogram in Matplotlib....235
Listing 6.4: histogram1.py....235
A Set of Line Segments in Matplotlib....236
Listing 6.5: line_segments.py....236
Plotting Multiple Lines in Matplotlib....237
Listing 6.6: plt_array2.py....237
Trigonometric Functions in Matplotlib....238
Listing 6.7: sincos.py....238
Display IQ Scores in Matplotlib....239
Listing 6.8: iq_scores.py....239
Plot a Best-Fitting Line in Matplotlib....240
Listing 6.9: plot_best_fit.py....240
The Iris Dataset in SkLearn....241
Listing 6.10: sklearn_iris1.py....241
SkLearn, Pandas, and the Iris Dataset....243
Listing 6.11: pandas_iris.py....243
Working with Seaborn....245
Features of Seaborn....246
Seaborn Built-in Datasets....246
Listing 6.12: seaborn_tips.py....246
The Iris Dataset in Seaborn....247
Listing 6.13: seaborn_iris.py....247
The Titanic Dataset in Seaborn....248
Listing 6.14: seaborn_titanic_plot.py....248
Extracting Data from the Titanic Dataset in Seaborn (1)....249
Listing 6.15: seaborn_titanic.py....249
Extracting Data from the Titanic Dataset in Seaborn (2)....251
Listing 6.16: seaborn_titanic2.py....251
Visualizing a Pandas Dataset in Seaborn....253
Listing 6.17: pandas_seaborn.py....253
Data Visualization in Pandas....255
Listing 6.18: pandas_viz1.py....255
What is Bokeh?....257
Listing 6.19: bokeh_trig.py....257
Summary....259
Appendix A: Working with Data....260
What are Datasets?....260
Data Preprocessing....261
Data Types....262
Preparing Datasets....263
Discrete Data vs. Continuous Data....263
“Binning” Continuous Data....264
Scaling Numeric Data via Normalization....265
Scaling Numeric Data via Standardization....266
What to Look for in Categorical Data....267
Mapping Categorical Data to Numeric Values....268
Working with Dates....270
Working with Currency....270
Missing Data, Anomalies, and Outliers....271
Missing Data....271
Anomalies and Outliers....271
Outlier Detection....272
What is Data Drift?....273
What is Imbalanced Classification?....274
What is SMOTE?....275
SMOTE Extensions....275
Analyzing Classifiers (Optional)....276
What is LIME?....276
What is ANOVA?....277
The Bias-Variance Trade-Off....277
Types of Bias in Data....279
Summary....280
Appendix B: Working with awk....282
The awk Command....283
Built-in Variables that Control awk....283
How Does the awk Command Work?....284
Aligning Text with the printf Statement....285
Listing B.1: columns2.txt....285
Listing B.2: AlignColumns1.sh....285
Conditional Logic and Control Statements....286
The while Statement....286
A for loop in awk....287
Listing B.3: Loop.sh....287
A for loop with a break Statement....288
The next and continue Statements....288
Deleting Alternate Lines in Datasets....289
Listing B.4: linepairs.csv....289
Listing B.5: deletelines.sh....289
Merging Lines in Datasets....289
Listing B.6: columns.txt....289
Listing B.7: ColumnCount1.sh....290
Printing File Contents as a Single Line....290
Joining Groups of Lines in a Text File....291
Listing B.8: digits.txt....291
Listing B.9: digits.sh....291
Joining Alternate Lines in a Text File....291
Listing B.10: columns2.txt....291
Listing B.11: JoinLines.sh....292
Listing B.12: JoinLines2.sh....292
Listing B.13: JoinLines2.sh....292
Matching with Meta Characters and Character Sets....293
Listing B.14: Patterns1.sh....293
Listing B.15: columns3.txt....293
Listing B.16: MatchAlpha1.sh....293
Printing Lines Using Conditional Logic....294
Listing B.17: products.txt....294
Splitting Filenames with awk....295
Listing B.18: SplitFilename2.sh....295
Working with Postfix Arithmetic Operators....295
Listing B.19: mixednumbers.txt....295
Listing B.20: AddSubtract1.sh....295
Numeric Functions in awk....296
One Line awk Commands....299
Useful Short awk Scripts....300
Listing B.21: data.txt....300
Printing the Words in a Text String in awk....301
Listing B.22: Fields2.sh....301
Count Occurrences of a String in Specific Rows....301
Listing B.23: data1.csv....302
Listing B.24: data2.csv....302
Listing B.25: checkrows.sh....302
Printing a String in a Fixed Number of Columns....303
Listing B.26: FixedFieldCount1.sh....303
Printing a Dataset in a Fixed Number of Columns....303
Listing B.27: VariableColumns.txt....303
Listing B.28: Fields3.sh....303
Aligning Columns in Datasets....304
Listing B.29: mixed-data.csv....304
Listing B.30: mixed-data.sh....304
Aligning Columns and Multiple Rows in Datasets....305
Listing B.31: mixed-data2.csv....305
Listing B.32: aligned-data2.csv....306
Listing B.33: mixed-data2.sh....306
Removing a Column from a Text File....306
Listing B.34: VariableColumns.txt....307
Listing B.35: RemoveColumn.sh....307
Subsets of Column-aligned Rows in Datasets....307
Listing B.36: sub-rows-cols.txt....307
Listing B.37: sub-rows-cols.sh....307
Counting Word Frequency in Datasets....308
Listing B.38: WordCounts1.sh....309
Listing B.39: WordCounts2.sh....309
Listing B.40: columns4.txt....310
Displaying Only “Pure” Words in a Dataset....310
Listing B.41: onlywords.sh....310
Working with Multi-line Records in awk....312
Listing B.42: employees.txt....312
Listing B.43: employees.sh....312
A Simple Use Case....313
Listing B.44: quotes3.csv....313
Listing B.45 delim1.sh....313
Another Use Case....315
Listing B.46: dates2.csv....315
Listing B.47: string2date2.sh....315
Summary....316
Index....318
As part of the best-selling Pocket Primer series, this book is designed to provide a thorough introduction to numerous Python tools for data scientists. The book covers features of NumPy and Pandas, how to write regular expressions, and how to perform data cleaning tasks. It includes separate chapters on data visualization and working with Sklearn and SciPy. Companion files with source code are available.