Data Without Labels....1
Praise for Data Without Labels....3
brief contents....8
contents....9
foreword....16
preface....18
acknowledgments....20
about this book....22
Who should read this book....22
How this book is organized: A road map....23
About the code....23
liveBook discussion forum....23
about the author....25
about the cover illustration....26
Part 1 Basics....27
1 Introduction to machine learning....29
1.1 Technical toolkit....30
1.2 Data, data types, data management, and quality....31
1.2.1 What is data?....31
1.2.2 Various types of data....32
1.2.3 Data quality....35
1.2.4 Data engineering and management....37
1.3 Data analysis, ML, AI, and business intelligence....38
1.4 Nuts and bolts of ML....40
1.5 Types of ML algorithms....43
1.5.1 Supervised learning....44
1.5.2 Unsupervised algorithms....50
1.5.3 Semisupervised algorithms....54
1.5.4 Reinforcement learning....54
1.6 Concluding thoughts....55
Summary....56
2 Clustering techniques....58
2.1 Technical toolkit....59
2.2 Clustering....60
2.3 Centroid-based clustering....63
2.3.1 K-means clustering....65
2.3.2 Measuring the accuracy of clustering....68
2.3.3 Finding the optimum value of k....69
2.3.4 Pros and cons of k-means clustering....70
2.3.5 K-means clustering implementation using Python....72
2.4 Connectivity-based clustering....76
2.4.1 Types of hierarchical clustering....78
2.4.2 Linkage criterion for distance measurement....79
2.4.3 Optimal number of clusters....80
2.4.4 Pros and cons of hierarchical clustering....82
2.4.5 Hierarchical clustering case study using Python....83
2.5 Density-based clustering....86
2.5.1 Neighborhood and density....86
2.5.2 DBSCAN clustering....88
2.6 Case study using clustering....93
2.6.1 Business context....94
2.6.2 Dataset for the analysis....95
2.6.3 Suggested solutions....96
2.6.4 Solution for the problem....96
2.7 Common challenges faced in clustering....98
2.8 Concluding thoughts....100
2.9 Practical next steps and suggested readings....100
Summary....101
3 Dimensionality reduction....103
3.1 Technical toolkit....104
3.2 The curse of dimensionality....104
3.3 Dimension reduction methods....108
3.3.1 Mathematical foundation....108
3.4 Manual methods of dimensionality reduction....108
3.4.1 Manual feature selection....109
3.4.2 Correlation coefficient....110
3.4.3 Algorithm-based methods for reducing dimensions....111
3.5 Principal component analysis....111
3.5.1 Eigenvalue decomposition....116
3.5.2 Python solution using PCA....117
3.6 Singular value decomposition....123
3.6.1 Python solution using SVD....124
3.7 Pros and cons of dimensionality reduction....127
3.8 Case study for dimension reduction....129
3.9 Concluding thoughts....132
3.10 Practical next steps and suggested readings....132
Summary....133
Part 2 Intermediate level....135
4 Association rules....137
4.1 Technical toolkit....138
4.2 Association rule overview....138
4.3 The building blocks of association rules....140
4.3.1 Support, confidence, lift, and conviction....141
4.4 Apriori algorithm....145
4.4.1 Python implementation....147
4.4.2 Challenges with the Apriori algorithm....151
4.5 Equivalence class clustering and bottom-up lattice traversal....152
4.5.1 Python implementation....155
4.6 F-P algorithm....156
4.7 Sequence rule mining....163
4.7.1 Sequential Pattern Discovery Using Equivalence....164
4.8 Case study for association rules....168
4.9 Concluding thoughts....171
4.10 Practical next steps and suggested readings....173
Summary....173
5 Clustering....175
5.1 Technical toolkit....176
5.2 Clustering: A brief recap....176
5.3 Spectral clustering....177
5.3.1 Building blocks of spectral clustering....179
5.3.2 The process of spectral clustering....182
5.4 Python implementation of spectral clustering....184
5.5 Fuzzy clustering....186
5.5.1 Types of fuzzy clustering....187
5.5.2 Python implementation of FCM....190
5.6 Gaussian mixture model....193
5.6.1 EM technique....195
5.6.2 Python implementation of GMM....197
5.7 Concluding thoughts....200
5.8 Practical next steps and suggested readings....200
Summary....201
6 Dimensionality reduction....202
6.1 Technical toolkit....203
6.2 Multidimensional scaling....203
6.2.1 Classic MDS....205
6.2.2 Nonmetric MDS....206
6.3 Python implementation of MDS....210
6.4 t-distributed stochastic neighbor embedding....215
6.4.1 Cauchy distribution....217
6.4.2 Python implementation of t-SNE....219
6.5 Uniform manifold approximation projection....222
6.5.1 Working with UMAP....223
6.5.2 Using UMAP....223
6.5.3 Key points of UMAP....224
6.6 Case study....224
6.7 Concluding thoughts....226
6.8 Practical next steps and suggested readings....226
Summary....227
7 Unsupervised learning for text data....228
7.1 Technical toolkit....229
7.2 Text data is everywhere....229
7.3 Use cases of text data....230
7.4 Challenges with text data....231
7.5 Preprocessing the text data....233
7.6 Data cleaning....233
7.7 Extracting features from the text dataset....235
7.8 Tokenization....236
7.9 BOW approach....237
7.10 Term frequency and inverse document frequency....239
7.11 Language models....240
7.12 Text cleaning using Python....242
7.13 Word embeddings....245
7.14 Word2Vec and GloVe....247
7.15 Sentiment analysis case study with Python implementation....248
7.16 Text clustering using Python....254
7.17 GenAI for text data....256
7.18 Concluding thoughts....256
7.19 Practical next steps and suggested readings....257
Summary....258
Part 3 Advanced concepts....259
8 Deep learning: The foundational concepts....261
8.1 Technical toolkit....262
8.1.1 Deep learning: What is it? What does it do?....262
8.2 Building blocks of a neural network....264
8.2.1 Neural networks for solutions....264
8.2.2 Artificial neurons and perceptrons....265
8.2.3 Different layers in a network....267
8.2.4 Activation functions....269
8.2.5 Hyperparameters....271
8.2.6 Optimization functions....272
8.3 How does deep learning work in a supervised manner?....274
8.3.1 Supervised learning algorithms....274
8.3.2 Step 1: Feed-forward propagation....274
8.3.3 Step 2: Adding the loss function....275
8.3.4 Step 3: Calculating the error....276
8.4 Backpropagation....276
8.4.1 The mathematics behind backpropagation....277
8.4.2 Step 4: Optimization....279
8.5 How deep learning works in an unsupervised manner....279
8.6 Convolutional neural networks....280
8.6.1 Key concepts of CNN....280
8.6.2 Use of CNN....282
8.7 Recurrent neural networks....282
8.7.1 Key concepts of RNN....282
8.8 Boltzmann learning rule....284
8.8.1 Concepts of the Boltzmann learning rule....284
8.8.2 Key points....285
8.9 Deep belief networks....285
8.9.1 Key points of DBN....285
8.10 Popular deep learning libraries....287
8.10.1 Python code for Keras and TF....288
8.11 Concluding thoughts....289
8.12 Practical next steps and suggested readings....290
Summary....291
9 Autoencoders....293
9.1 Technical toolkit....293
9.2 Feature learning....294
9.3 Introducing autoencoders....294
9.4 Components of autoencoders....295
9.5 Training of autoencoders....296
9.6 Application of autoencoders....297
9.7 Types of autoencoders....297
9.8 Python implementation of autoencoders....301
9.9 Concluding thoughts....303
9.10 Practical next steps and suggested readings....303
Summary....304
10 Generative adversarial networks, generative AI, and ChatGPT....305
10.1 AI: A transformation....305
10.2 GenAI and its significance....306
10.3 Discriminative models and GenAI....308
10.4 Generative adversarial networks....309
10.4.1 The generator network....309
10.4.2 The discriminator network....310
10.4.3 Adversarial training....311
10.4.4 Variants and applications of GANs....312
10.4.5 BERT, GPT-3, and others....312
10.5 ChatGPT and its details....313
10.5.1 Key features of ChatGPT....313
10.5.2 Applications of ChatGPT....313
10.6 Integration of GenAI....314
10.7 Concluding thoughts....315
10.8 Practical next steps and suggested readings....316
Summary....316
11 End-to-end model deployment....317
11.1 The machine learning modeling process....318
11.2 Business problem definition....318
11.3 Data discovery and feasibility analysis....320
11.4 Data cleaning and prepreparation....321
11.5 Duplicate values in the data....321
11.6 Categorical variables....322
11.7 Missing values in dataset....323
11.8 Outliers present in the data....325
11.9 Exploratory data analysis....325
11.10 Model development and business approval....326
11.11 Model deployment....326
11.12 Purpose of model deployment....326
11.13 Types of model deployment....327
11.14 Considerations while deploying the model....328
11.15 Documentation....329
11.16 Model maintenance and refresh....329
11.17 Concluding thoughts....330
11.18 Practical next steps and suggested readings....331
Summary....331
appendix A Mathematical foundations....333
A.1 List of clustering algorithms....333
A.1.1 Partitioning-based algorithms....333
A.1.2 Hierarchical clustering....333
A.1.3 Density-based algorithms....333
A.1.4 Grid-based algorithms....334
A.1.5 Model-based algorithms....334
A.1.6 Spectral clustering....334
A.1.7 Graph-based clustering....334
A.1.8 Subspace and high-dimensional clustering....335
A.1.9 Fuzzy and soft clustering....335
A.1.10 Constraint-based clustering....335
A.1.11 Evolutionary and genetic clustering....335
A.1.12 Neural network-based clustering....336
A.1.13 Other algorithms....336
A.2 What is a centroid?....336
A.3 L1 vs. L2 norm....336
A.4 Different scaling techniques used in the industry....336
A.5 Time complexity O(n)....337
A.6 How to install packages in Python....338
A.7 Correlation....338
A.7.1 Correlation coefficient....339
A.7.2 Uses of correlation....339
A.7.3 Important considerations....339
A.8 Time-series analysis....340
A.9 Mathematical foundation for data representation....340
A.9.1 Scalar and vector....341
A.9.2 Standard deviation and variance....341
A.9.3 Covariance and correlation....342
A.9.4 Matrix decomposition, eigenvectors, and eigenvalues....343
A.9.5 Special matrices....344
A.10 Hyperparameters vs. parameters....344
index....345
A....345
B....345
C....346
D....346
E....347
F....347
G....348
H....348
I....348
J....348
K....349
L....349
M....349
N....350
O....350
P....350
Q....351
R....351
S....351
T....351
U....352
V....352
W....352
X....352
Y....352
Z....352
Data Without Labels - back....354
Discover all-practical implementations of the key algorithms and models for handling unlabeled data. Full of case studies demonstrating how to apply each technique to real-world problems.
Data Without Labels introduces mathematical techniques, key algorithms, and Python implementations that will help you build machine learning models for unannotated data. You’ll discover hands-off and unsupervised machine learning approaches that can still untangle raw, real-world datasets and support sound strategic decisions for your business.
Don’t get bogged down in theory—the book bridges the gap between complex math and practical Python implementations, covering end-to-end model development all the way through to production deployment. You’ll discover the business use cases for machine learning and unsupervised learning, and access insightful research papers to complete your knowledge.
Generative AI, predictive algorithms, fraud detection, and many other analysis tasks rely on cheap and plentiful unlabeled data. Machine learning on data without labels—or unsupervised learning—turns raw text, images, and numbers into insights about your customers, accurate computer vision, and high-quality datasets for training AI models. This book will show you how.
Data Without Labels is a comprehensive guide to unsupervised learning, offering a deep dive into its mathematical foundations, algorithms, and practical applications. It presents practical examples from retail, aviation, and banking using fully annotated Python code. You’ll explore core techniques like clustering and dimensionality reduction along with advanced topics like autoencoders and GANs. As you go, you’ll learn where to apply unsupervised learning in business applications and discover how to develop your own machine learning models end-to-end.
Intended for data science professionals. Assumes knowledge of Python and basic machine learning.