Table of Contents....4
About the Author....8
About the Technical Reviewers....9
Introduction....10
Chapter 1: Information and Entropy....14
Entropy....14
Entropy of a Continuous Random Variable....18
Partitioning a Continuous Variable for Entropy....18
An Example of Improving Entropy....23
Joint and Conditional Entropy....25
Code for Conditional Entropy....29
Mutual Information....30
Fano’s Bound and Selection of Predictor Variables....32
Confusion Matrices and Mutual Information....34
Extending Fano’s Bound for Upper Limits....36
Simple Algorithms for Mutual Information....40
The TEST_DIS Program....47
Continuous Mutual Information....49
The Parzen Window Method....50
Adaptive Partitioning....58
The TEST_CON Program....73
Asymmetric Information Measures....74
Uncertainty Reduction....74
Transfer Entropy: Schreiber’s Information Transfer....78
Chapter 2: Screening for Relationships....87
Simple Screening Methods....87
Univariate Screening....88
Bivariate Screening....88
Forward Stepwise Selection....88
Forward Selection Preserving Subsets....89
Backward Stepwise Selection....89
Criteria for a Relationship....89
Ordinary Correlation....90
Nonparametric Correlation....91
Accommodating Simple Nonlinearity....94
Chi-Square and Cramer’s V....97
Mutual Information and Uncertainty Reduction....100
Multivariate Extensions....100
Permutation Tests....101
A Modestly Rigorous Statement of the Procedure....101
A More Intuitive Approach....103
Serial Correlation Can Be Deadly....105
Permutation Algorithms....105
Outline of the Permutation Test Algorithm....106
Permutation Testing for Selection Bias....107
Combinatorially Symmetric Cross Validation....109
The CSCV Algorithm....114
An Example of CSCV OOS Testing....121
Univariate Screening for Relationships....122
Three Simple Examples....126
Bivariate Screening for Relationships....128
Stepwise Predictor Selection Using Mutual Information....136
Maximizing Relevance While Minimizing Redundancy....137
Code for the Relevance Minus Redundancy Algorithm....140
An Example of Relevance Minus Redundancy....144
A Superior Selection Algorithm for Binary Variables....148
FREL for High-Dimensionality, Small Size Datasets....153
Regularization....157
Interpreting Weights....158
Bootstrapping FREL....158
Monte Carlo Permutation Tests of FREL....159
General Statement of the FREL Algorithm....161
Multithreaded Code for FREL....165
Some FREL Examples....176
Chapter 3: Displaying Relationship Anomalies....179
Marginal Density Product....183
Actual Density....183
Marginal Inconsistency....183
Mutual Information Contribution....184
Code for Computing These Plots....185
Comments on Showing the Display....195
Chapter 4: Fun with Eigenvectors....197
Eigenvalues and Eigenvectors....198
Principal Components (If You Really Must)....200
The Factor Structure Is More Interesting....201
A Simple Example....202
Rotation Can Make Naming Easier....204
Code for Eigenvectors and Rotation....206
Eigenvectors of a Real Symmetric Matrix....206
Factor Structure of a Dataset....208
Varimax Rotation....211
Horn’s Algorithm for Determining Dimensionality....214
Code for the Modified Horn Algorithm....215
Clustering Variables in a Subspace....225
Code for Clustering Variables....229
Separating Individual from Common Variance....233
Log Likelihood the Slow, Definitional Way....240
Log Likelihood the Fast, Intelligent Way....242
The Basic Expectation Maximization Algorithm....244
Code for Basic Expectation Maximization....246
Accelerating the EM Algorithm....249
Code for Quadratic Acceleration with DECME-2s....253
Putting It All Together....258
Thoughts on My Version of the Algorithm....269
Measuring Coherence....269
Code for Tracking Coherence....272
Coherence in the Stock Market....276
Chapter 5: Using the DATAMINE Program....278
File/Read Data File....278
File/Exit....279
Screen/Univariate Screen....279
Screen/Bivariate Screen....280
Screen/Relevance Minus Redundancy....282
Screen/FREL....283
Analyze/Eigen Analysis....285
Analyze/Factor Analysis....285
Analyze/Rotate....286
Analyze/Cluster Variables....287
Analyze/Coherence....287
Plot/Series....288
Plot/Histogram....288
Plot/Density....288
Index....291
Data mining is a broad, deep, and frequently ambiguous field. Authorities don’t even agree on a definition for the term. What I will do is tell you how I interpret the term, especially as it applies to this book. But first, some personal history that sets the background for this book…
I’ve been blessed to work as a consultant in a wide variety of fields, enjoying rare diversity in my work. Early in my career, I developed computer algorithms that examined high-altitude photographs in an attempt to discover useful things. How many bushels of wheat can be expected from Midwestern farm fields this year? Are any of those fields showing signs of disease? How much water is stored in mountain ice packs? Is that anomaly a disguised missile silo? Is it a nuclear test site?
Eventually I moved on to the medical field and then finance: Does this photomicrograph of a tissue slice show signs of malignancy? Do these recent price movements presage a market collapse?
All of these endeavors have something in common: they all require that we find variables that are meaningful in the context of the application. These variables might address specific tasks, such as finding effective predictors for a prediction model. Or the variables might address more general tasks such as unguided exploration, seeking unexpected relationships among variables—relationships that might lead to novel approaches to solving the problem.
That, then, is the motivation for this book. I have taken some of my most-used techniques, those that I have found to be especially valuable in the study of relationships among variables, and documented them with basic theoretical foundations and wellcommented C++ source code. Naturally, this collection is far from complete. Maybe Volume 2 will appear someday. But this volume should keep you busy for a while
You may wonder why I have included a few techniques that are widely available in standard statistical packages, namely, very old techniques such as maximum likelihood factor analysis and varimax rotation. In these cases, I included them because they are useful, and yet reliable source code for these techniques is difficult to obtain. There are times when it’s more convenient to have your own versions of old workhorses, integrated xii into your own personal or proprietary programs, than to be forced to coexist with canned packages that may not fetch data or present results in the way that you want.