Cover....1
Title Page....2
Copyright....3
Acknowledgments....4
Contributors....5
Table of Contents....8
Preface....16
Part 1: Labeling Tabular Data....22
Chapter 1: Exploring Data for Machine Learning....24
Technical requirements....25
EDA and data labeling....26
Understanding the ML project life cycle....27
Defining the business problem....27
Data discovery and data collection....27
Data exploration....28
Data labeling....28
Model training....29
Model evaluation....29
Model deployment....29
Introducing Pandas DataFrames....29
Summary statistics and data aggregates....33
Summary statistics....34
Data aggregates of the feature for each target class....35
Creating visualizations using Seaborn for univariate and bivariate analysis....36
Univariate analysis....36
Bivariate analysis....42
Profiling data using the ydata-profiling library....46
Variables section....50
Interactions section....51
Correlations....52
Missing values....54
Sample data....56
Unlocking insights from data with OpenAI and LangChain....57
Summary....65
Chapter 2: Labeling Data for Classification....68
Technical requirements....68
Predicting labels with LLMs for tabular data....69
Data labeling using Snorkel....71
What is Snorkel?....72
Why is Snorkel popular?....73
Loading unlabeled data....74
Creating the labeling functions....74
Labeling rules....74
Constants....75
Labeling functions....75
Creating a label model....77
Predicting labels....78
Labeling data using the Compose library....79
Labeling data using semi-supervised learning....80
What is semi-supervised learning?....80
What is pseudo-labeling?....81
Labeling data using K-means clustering....83
What is unsupervised learning?....83
K-means clustering....84
Inertia....86
Dunn's index....86
Summary....87
Chapter 3: Labeling Data for Regression....88
Technical requirements....89
Using summary statistics to generate housing price labels....89
Finding the closest labeled observation to match the label....90
Using semi-supervised learning to label regression data....92
Pseudo-labeling....92
Using data augmentation to label regression data....95
Using k-means clustering to label regression data....99
Summary....103
Part 2: Labeling Image Data....104
Chapter 4: Exploring Image Data....106
Technical requirements....107
Visualizing image data using Matplotlib in Python....107
Loading the data....109
Checking the dimensions....109
Visualizing the data....109
Checking for outliers....109
Performing data preprocessing....109
Checking for class imbalance....111
Identifying patterns and relationships....112
Evaluating the impact of preprocessing....112
Practice example of visualizing data....112
Practice example for adding annotations to an image....118
Practice example of image segmentation....119
Practice example for feature extraction....121
Analyzing image size and aspect ratio....123
Impact of aspect ratios on model performance....123
Image resizing....125
Image normalization....130
Performing transformations on images – image augmentation....132
Summary....135
Chapter 5: Labeling Image Data Using Rules....136
Technical requirements....136
Labeling rules based on image visualization....137
Image labeling using rules with Snorkel....137
Weak supervision....137
Rules based on the manual visualization of an image’s object color....138
Real-world applications....139
A practical example of plant disease detection....141
Labeling images using rules based on properties....143
Bounding boxes....144
Example 1 – image classification – a bicycle with and without a person....145
Example 2 – image classification – dog and cat images....147
Labeling images using transfer learning....149
Example – digit classification using a pre-trained classifier....149
Example – person image detection using the YOLO V3 pre-trained classifier....152
Example – bicycle image detection using the YOLO V3 pre-trained classifier....153
Labeling images using transformations....153
Summary....154
Chapter 6: Labeling Image Data Using Data Augmentation....156
Technical requirements....156
Training support vector machines with augmented image data....157
Kernel trick....158
Data augmentation....158
Image data augmentation....158
Implementing an SVM with data augmentation in Python....160
Introducing the CIFAR-10 dataset....160
Loading the CIFAR-10 dataset in Python....160
Preprocessing the data for SVM training....161
Implementing an SVM with the default hyperparameters....162
Evaluating SVM on the original dataset....163
Implementing an SVM with an augmented dataset....163
Training the SVM on augmented data....164
Evaluating the SVM’s performance on the augmented dataset....164
Image classification using the SVM with data augmentation on the MNIST dataset....165
Convolutional neural networks using augmented image data....167
How CNNs work....167
Practical example of a CNN using data augmentation....169
CNN using image data augmentation with the CIFAR-10 dataset....174
Summary....177
Part 3: Labeling Text, Audio, and Video Data....180
Chapter 7: Labeling Text Data....182
Technical requirements....182
Real-world applications of text data labeling....183
Tools and frameworks for text data labeling....185
Exploratory data analysis of text....187
Loading the data....187
Understanding the data....187
Cleaning and preprocessing the data....187
Exploring the text’s content....188
Analyzing relationships between text and other variables....188
Visualizing the results....188
Exploratory data analysis of sample text data set....188
Exploring Generative AI and OpenAI for labeling text data....192
GPT models by OpenAI....193
Zero-shot learning capabilities....193
Text classification with OpenAI models....193
Data labeling assistance....193
OpenAI API overview....193
Use case 1 – summarizing the text....194
Use case 2 – topic generation for news articles....197
Use case 3 – classification of customer queries using the user-defined categories and sub-categories....197
Use case 4 – information retrieval using entity extraction....200
Use case 5 – aspect-based sentiment analysis....202
Hands-on labeling of text data using the Snorkel API....203
Hands-on text labeling using Logistic Regression....210
Hands-on label prediction using K-means clustering....213
Generating labels for customer reviews (sentiment analysis)....215
Summary....218
Chapter 8: Exploring Video Data....220
Technical requirements....221
Loading video data using cv2....221
Extracting frames from video data for analysis....222
Extracting features from video frames....223
Color histogram....223
Optical flow features....225
Motion vectors....225
Deep learning features....226
Appearance and shape descriptors....226
Visualizing video data using Matplotlib....227
Frame visualization....228
Temporal visualization....228
Motion visualization....229
Labeling video data using k-means clustering....230
Overview of data labeling using k-means clustering....231
Example of video data labeling using k-means clustering with a color histogram....231
Advanced concepts in video data analysis....235
Motion analysis in videos....235
Object tracking in videos....237
Facial recognition in videos....239
Video compression techniques....240
Real-time video processing....241
Video data formats and quality in machine learning....243
Common issues in handling video data for ML models....244
Troubleshooting steps....244
Summary....245
Chapter 9: Labeling Video Data....246
Technical requirements....247
Capturing real-time video....247
Key components and features....248
A hands-on example to capture real-time video using a webcam....248
Building a CNN model for labeling video data....249
Using autoencoders for video data labeling....255
A hands-on example to label video data using autoencoders....257
Transfer learning....264
Using the Watershed algorithm for video data labeling....266
A hands-on example to label video data segmentation using the Watershed algorithm....266
Computational complexity....271
Performance metrics....271
Real-world examples for video data labeling....272
Advances in video data labeling and classification....273
Summary....275
Chapter 10: Exploring Audio Data....276
Technical requirements....277
Real-life applications for labeling audio data....277
Audio data fundamentals....280
Hands-on with analyzing audio data....283
Example code for loading and analyzing sample audio file....283
Best practices for audio format conversion....286
Example code for audio data cleaning....287
Extracting properties from audio data....289
Tempo....289
Chroma features....290
Mel-frequency cepstral coefficients (MFCCs)....291
Zero-crossing rate....292
Spectral contrast....293
Considerations for extracting properties....294
Visualizing audio data with matplotlib and Librosa....294
Waveform visualization....295
Loudness visualization....296
Spectrogram visualization....297
Mel spectrogram visualization....298
Considerations for visualizations....301
Ethical implications of audio data....301
Recent advances in audio data analysis....303
Troubleshooting common issues during data analysis....304
Troubleshooting common installation issues for audio libraries....306
Summary....308
Chapter 11: Labeling Audio Data....310
Technical requirements....311
Downloading FFmpeg....312
Azure Machine Learning....312
Real-time voice classification with Random Forest....312
Transcribing audio using the OpenAI Whisper model....317
Step 1 – importing the Whisper model....318
Step 2 – loading the base Whisper model....318
Step 3 – setting up FFmpeg....319
Step 4 – transcribing the YouTube audio using the Whisper model....320
Classifying a transcription using Hugging Face transformers....321
Hands-on – labeling audio data using a CNN....321
Exploring audio data augmentation....328
Introducing Azure Cognitive Services – the speech service....333
Creating an Azure Speech service....333
Speech to text....334
Speech translation....335
Summary....337
Chapter 12: Hands-On Exploring Data Labeling Tools....338
Technical requirements....339
Azure Machine Learning data labeling....339
Label Studio....339
pyOpenAnnotate....339
Data labeling using Azure Machine Learning....339
Benefits of data labeling with Azure Machine Learning....340
Data labeling steps using Azure Machine Learning....340
Image data labeling with Azure Machine Learning....341
Text data labeling with Azure Machine Learning....355
Audio data labeling using Azure Machine Learning....363
Integration of the Azure Machine Learning pipeline with the labeled dataset....366
Exploring Label Studio....367
Labeling the image data....368
Labeling the text data....372
Labeling the video data....373
pyOpenAnnotate....374
Computer Vision Annotation Tool....376
Comparison of data labeling tools....377
Advanced methods in data labeling....378
Active learning....379
Semi-automated labeling....379
Summary....380
Index....382
About PACKT....394
Other Books You May Enjoy....395
Data labeling is the invisible hand that guides the power of artificial intelligence and machine learning. In today's data-driven world, mastering data labeling is not just an advantage, it's a necessity. Data Labeling in Machine Learning with Python empowers you to unearth value from raw data, create intelligent systems, and influence the course of technological evolution.
With this book, you'll discover the art of employing summary statistics, weak supervision, programmatic rules, and heuristics to assign labels to unlabeled training data programmatically. As you progress, you'll be able to enhance your datasets by mastering the intricacies of semi-supervised learning and data augmentation. Venturing further into the data landscape, you'll immerse yourself in the annotation of image, video, and audio data, harnessing the power of Python libraries such as seaborn, matplotlib, cv2, librosa, openai, and langchain. With hands-on guidance and practical examples, you'll gain proficiency in annotating diverse data types effectively.
By the end of this book, you'll have the practical expertise to programmatically label diverse data types and enhance datasets, unlocking the full potential of your data.
This book is for machine learning engineers, data scientists, and data engineers who want to learn data labeling methods and algorithms for model training. Data enthusiasts and Python developers will be able to use this book to learn data exploration and annotation using Python libraries. Basic Python knowledge is beneficial but not necessary to get started.