Data Mining Algorithms in C++: Data Patterns and Algorithms for Modern Applications

Data Mining Algorithms in C++: Data Patterns and Algorithms for Modern Applications

Data Mining Algorithms in C++: Data Patterns and Algorithms for Modern Applications
Автор: Masters Timothy
Дата выхода: 2018
Издательство: Apress Media, LLC.
Количество страниц: 296
Размер файла: 2.4 MB
Тип файла: PDF
Добавил: codelibs
 Проверить на вирусы  Дополнительные материалы 

Table of Contents....4

About the Author....8

About the Technical Reviewers....9

Introduction....10

Chapter 1: Information and Entropy....14

Entropy....14

Entropy of a Continuous Random Variable....18

Partitioning a Continuous Variable for Entropy....18

An Example of Improving Entropy....23

Joint and Conditional Entropy....25

Code for Conditional Entropy....29

Mutual Information....30

Fano’s Bound and Selection of Predictor Variables....32

Confusion Matrices and Mutual Information....34

Extending Fano’s Bound for Upper Limits....36

Simple Algorithms for Mutual Information....40

The TEST_DIS Program....47

Continuous Mutual Information....49

The Parzen Window Method....50

Adaptive Partitioning....58

The TEST_CON Program....73

Asymmetric Information Measures....74

Uncertainty Reduction....74

Transfer Entropy: Schreiber’s Information Transfer....78

Chapter 2: Screening for Relationships....87

Simple Screening Methods....87

Univariate Screening....88

Bivariate Screening....88

Forward Stepwise Selection....88

Forward Selection Preserving Subsets....89

Backward Stepwise Selection....89

Criteria for a Relationship....89

Ordinary Correlation....90

Nonparametric Correlation....91

Accommodating Simple Nonlinearity....94

Chi-Square and Cramer’s V....97

Mutual Information and Uncertainty Reduction....100

Multivariate Extensions....100

Permutation Tests....101

A Modestly Rigorous Statement of the Procedure....101

A More Intuitive Approach....103

Serial Correlation Can Be Deadly....105

Permutation Algorithms....105

Outline of the Permutation Test Algorithm....106

Permutation Testing for Selection Bias....107

Combinatorially Symmetric Cross Validation....109

The CSCV Algorithm....114

An Example of CSCV OOS Testing....121

Univariate Screening for Relationships....122

Three Simple Examples....126

Bivariate Screening for Relationships....128

Stepwise Predictor Selection Using Mutual Information....136

Maximizing Relevance While Minimizing Redundancy....137

Code for the Relevance Minus Redundancy Algorithm....140

An Example of Relevance Minus Redundancy....144

A Superior Selection Algorithm for Binary Variables....148

FREL for High-Dimensionality, Small Size Datasets....153

Regularization....157

Interpreting Weights....158

Bootstrapping FREL....158

Monte Carlo Permutation Tests of FREL....159

General Statement of the FREL Algorithm....161

Multithreaded Code for FREL....165

Some FREL Examples....176

Chapter 3: Displaying Relationship Anomalies....179

Marginal Density Product....183

Actual Density....183

Marginal Inconsistency....183

Mutual Information Contribution....184

Code for Computing These Plots....185

Comments on Showing the Display....195

Chapter 4: Fun with Eigenvectors....197

Eigenvalues and Eigenvectors....198

Principal Components (If You Really Must)....200

The Factor Structure Is More Interesting....201

A Simple Example....202

Rotation Can Make Naming Easier....204

Code for Eigenvectors and Rotation....206

Eigenvectors of a Real Symmetric Matrix....206

Factor Structure of a Dataset....208

Varimax Rotation....211

Horn’s Algorithm for Determining Dimensionality....214

Code for the Modified Horn Algorithm....215

Clustering Variables in a Subspace....225

Code for Clustering Variables....229

Separating Individual from Common Variance....233

Log Likelihood the Slow, Definitional Way....240

Log Likelihood the Fast, Intelligent Way....242

The Basic Expectation Maximization Algorithm....244

Code for Basic Expectation Maximization....246

Accelerating the EM Algorithm....249

Code for Quadratic Acceleration with DECME-2s....253

Putting It All Together....258

Thoughts on My Version of the Algorithm....269

Measuring Coherence....269

Code for Tracking Coherence....272

Coherence in the Stock Market....276

Chapter 5: Using the DATAMINE Program....278

File/Read Data File....278

File/Exit....279

Screen/Univariate Screen....279

Screen/Bivariate Screen....280

Screen/Relevance Minus Redundancy....282

Screen/FREL....283

Analyze/Eigen Analysis....285

Analyze/Factor Analysis....285

Analyze/Rotate....286

Analyze/Cluster Variables....287

Analyze/Coherence....287

Plot/Series....288

Plot/Histogram....288

Plot/Density....288

Index....291

Data mining is a broad, deep, and frequently ambiguous field. Authorities don’t even agree on a definition for the term. What I will do is tell you how I interpret the term, especially as it applies to this book. But first, some personal history that sets the background for this book…

I’ve been blessed to work as a consultant in a wide variety of fields, enjoying rare diversity in my work. Early in my career, I developed computer algorithms that examined high-altitude photographs in an attempt to discover useful things. How many bushels of wheat can be expected from Midwestern farm fields this year? Are any of those fields showing signs of disease? How much water is stored in mountain ice packs? Is that anomaly a disguised missile silo? Is it a nuclear test site?

Eventually I moved on to the medical field and then finance: Does this photomicrograph of a tissue slice show signs of malignancy? Do these recent price movements presage a market collapse?

All of these endeavors have something in common: they all require that we find variables that are meaningful in the context of the application. These variables might address specific tasks, such as finding effective predictors for a prediction model. Or the variables might address more general tasks such as unguided exploration, seeking unexpected relationships among variables—relationships that might lead to novel approaches to solving the problem.

That, then, is the motivation for this book. I have taken some of my most-used techniques, those that I have found to be especially valuable in the study of relationships among variables, and documented them with basic theoretical foundations and wellcommented C++ source code. Naturally, this collection is far from complete. Maybe Volume 2 will appear someday. But this volume should keep you busy for a while

You may wonder why I have included a few techniques that are widely available in standard statistical packages, namely, very old techniques such as maximum likelihood factor analysis and varimax rotation. In these cases, I included them because they are useful, and yet reliable source code for these techniques is difficult to obtain. There are times when it’s more convenient to have your own versions of old workhorses, integrated xii into your own personal or proprietary programs, than to be forced to coexist with canned packages that may not fetch data or present results in the way that you want.


Похожее:

Список отзывов:

Нет отзывов к книге.