Data Labeling in Machine Learning with Python: Explore modern ways to prepare labeled data for training and fine-tuning ML and generative AI models

Data Labeling in Machine Learning with Python: Explore modern ways to prepare labeled data for training and fine-tuning ML and generative AI models

Data Labeling in Machine Learning with Python: Explore modern ways to prepare labeled data for training and fine-tuning ML and generative AI models
Автор: Suda Vijaya Kumar
Дата выхода: 2024
Издательство: Packt Publishing Limited
Количество страниц: 398
Размер файла: 5.8 MB
Тип файла: PDF
Добавил: codelibs
 Проверить на вирусы  Дополнительные материалы 

Cover....1

Title Page....2

Copyright....3

Acknowledgments....4

Contributors....5

Table of Contents....8

Preface....16

Part 1: Labeling Tabular Data....22

Chapter 1: Exploring Data for Machine Learning....24

Technical requirements....25

EDA and data labeling....26

Understanding the ML project life cycle....27

Defining the business problem....27

Data discovery and data collection....27

Data exploration....28

Data labeling....28

Model training....29

Model evaluation....29

Model deployment....29

Introducing Pandas DataFrames....29

Summary statistics and data aggregates....33

Summary statistics....34

Data aggregates of the feature for each target class....35

Creating visualizations using Seaborn for univariate and bivariate analysis....36

Univariate analysis....36

Bivariate analysis....42

Profiling data using the ydata-profiling library....46

Variables section....50

Interactions section....51

Correlations....52

Missing values....54

Sample data....56

Unlocking insights from data with OpenAI and LangChain....57

Summary....65

Chapter 2: Labeling Data for Classification....68

Technical requirements....68

Predicting labels with LLMs for tabular data....69

Data labeling using Snorkel....71

What is Snorkel?....72

Why is Snorkel popular?....73

Loading unlabeled data....74

Creating the labeling functions....74

Labeling rules....74

Constants....75

Labeling functions....75

Creating a label model....77

Predicting labels....78

Labeling data using the Compose library....79

Labeling data using semi-supervised learning....80

What is semi-supervised learning?....80

What is pseudo-labeling?....81

Labeling data using K-means clustering....83

What is unsupervised learning?....83

K-means clustering....84

Inertia....86

Dunn's index....86

Summary....87

Chapter 3: Labeling Data for Regression....88

Technical requirements....89

Using summary statistics to generate housing price labels....89

Finding the closest labeled observation to match the label....90

Using semi-supervised learning to label regression data....92

Pseudo-labeling....92

Using data augmentation to label regression data....95

Using k-means clustering to label regression data....99

Summary....103

Part 2: Labeling Image Data....104

Chapter 4: Exploring Image Data....106

Technical requirements....107

Visualizing image data using Matplotlib in Python....107

Loading the data....109

Checking the dimensions....109

Visualizing the data....109

Checking for outliers....109

Performing data preprocessing....109

Checking for class imbalance....111

Identifying patterns and relationships....112

Evaluating the impact of preprocessing....112

Practice example of visualizing data....112

Practice example for adding annotations to an image....118

Practice example of image segmentation....119

Practice example for feature extraction....121

Analyzing image size and aspect ratio....123

Impact of aspect ratios on model performance....123

Image resizing....125

Image normalization....130

Performing transformations on images – image augmentation....132

Summary....135

Chapter 5: Labeling Image Data Using Rules....136

Technical requirements....136

Labeling rules based on image visualization....137

Image labeling using rules with Snorkel....137

Weak supervision....137

Rules based on the manual visualization of an image’s object color....138

Real-world applications....139

A practical example of plant disease detection....141

Labeling images using rules based on properties....143

Bounding boxes....144

Example 1 – image classification – a bicycle with and without a person....145

Example 2 – image classification – dog and cat images....147

Labeling images using transfer learning....149

Example – digit classification using a pre-trained classifier....149

Example – person image detection using the YOLO V3 pre-trained classifier....152

Example – bicycle image detection using the YOLO V3 pre-trained classifier....153

Labeling images using transformations....153

Summary....154

Chapter 6: Labeling Image Data Using Data Augmentation....156

Technical requirements....156

Training support vector machines with augmented image data....157

Kernel trick....158

Data augmentation....158

Image data augmentation....158

Implementing an SVM with data augmentation in Python....160

Introducing the CIFAR-10 dataset....160

Loading the CIFAR-10 dataset in Python....160

Preprocessing the data for SVM training....161

Implementing an SVM with the default hyperparameters....162

Evaluating SVM on the original dataset....163

Implementing an SVM with an augmented dataset....163

Training the SVM on augmented data....164

Evaluating the SVM’s performance on the augmented dataset....164

Image classification using the SVM with data augmentation on the MNIST dataset....165

Convolutional neural networks using augmented image data....167

How CNNs work....167

Practical example of a CNN using data augmentation....169

CNN using image data augmentation with the CIFAR-10 dataset....174

Summary....177

Part 3: Labeling Text, Audio, and Video Data....180

Chapter 7: Labeling Text Data....182

Technical requirements....182

Real-world applications of text data labeling....183

Tools and frameworks for text data labeling....185

Exploratory data analysis of text....187

Loading the data....187

Understanding the data....187

Cleaning and preprocessing the data....187

Exploring the text’s content....188

Analyzing relationships between text and other variables....188

Visualizing the results....188

Exploratory data analysis of sample text data set....188

Exploring Generative AI and OpenAI for labeling text data....192

GPT models by OpenAI....193

Zero-shot learning capabilities....193

Text classification with OpenAI models....193

Data labeling assistance....193

OpenAI API overview....193

Use case 1 – summarizing the text....194

Use case 2 – topic generation for news articles....197

Use case 3 – classification of customer queries using the user-defined categories and sub-categories....197

Use case 4 – information retrieval using entity extraction....200

Use case 5 – aspect-based sentiment analysis....202

Hands-on labeling of text data using the Snorkel API....203

Hands-on text labeling using Logistic Regression....210

Hands-on label prediction using K-means clustering....213

Generating labels for customer reviews (sentiment analysis)....215

Summary....218

Chapter 8: Exploring Video Data....220

Technical requirements....221

Loading video data using cv2....221

Extracting frames from video data for analysis....222

Extracting features from video frames....223

Color histogram....223

Optical flow features....225

Motion vectors....225

Deep learning features....226

Appearance and shape descriptors....226

Visualizing video data using Matplotlib....227

Frame visualization....228

Temporal visualization....228

Motion visualization....229

Labeling video data using k-means clustering....230

Overview of data labeling using k-means clustering....231

Example of video data labeling using k-means clustering with a color histogram....231

Advanced concepts in video data analysis....235

Motion analysis in videos....235

Object tracking in videos....237

Facial recognition in videos....239

Video compression techniques....240

Real-time video processing....241

Video data formats and quality in machine learning....243

Common issues in handling video data for ML models....244

Troubleshooting steps....244

Summary....245

Chapter 9: Labeling Video Data....246

Technical requirements....247

Capturing real-time video....247

Key components and features....248

A hands-on example to capture real-time video using a webcam....248

Building a CNN model for labeling video data....249

Using autoencoders for video data labeling....255

A hands-on example to label video data using autoencoders....257

Transfer learning....264

Using the Watershed algorithm for video data labeling....266

A hands-on example to label video data segmentation using the Watershed algorithm....266

Computational complexity....271

Performance metrics....271

Real-world examples for video data labeling....272

Advances in video data labeling and classification....273

Summary....275

Chapter 10: Exploring Audio Data....276

Technical requirements....277

Real-life applications for labeling audio data....277

Audio data fundamentals....280

Hands-on with analyzing audio data....283

Example code for loading and analyzing sample audio file....283

Best practices for audio format conversion....286

Example code for audio data cleaning....287

Extracting properties from audio data....289

Tempo....289

Chroma features....290

Mel-frequency cepstral coefficients (MFCCs)....291

Zero-crossing rate....292

Spectral contrast....293

Considerations for extracting properties....294

Visualizing audio data with matplotlib and Librosa....294

Waveform visualization....295

Loudness visualization....296

Spectrogram visualization....297

Mel spectrogram visualization....298

Considerations for visualizations....301

Ethical implications of audio data....301

Recent advances in audio data analysis....303

Troubleshooting common issues during data analysis....304

Troubleshooting common installation issues for audio libraries....306

Summary....308

Chapter 11: Labeling Audio Data....310

Technical requirements....311

Downloading FFmpeg....312

Azure Machine Learning....312

Real-time voice classification with Random Forest....312

Transcribing audio using the OpenAI Whisper model....317

Step 1 – importing the Whisper model....318

Step 2 – loading the base Whisper model....318

Step 3 – setting up FFmpeg....319

Step 4 – transcribing the YouTube audio using the Whisper model....320

Classifying a transcription using Hugging Face transformers....321

Hands-on – labeling audio data using a CNN....321

Exploring audio data augmentation....328

Introducing Azure Cognitive Services – the speech service....333

Creating an Azure Speech service....333

Speech to text....334

Speech translation....335

Summary....337

Chapter 12: Hands-On Exploring Data Labeling Tools....338

Technical requirements....339

Azure Machine Learning data labeling....339

Label Studio....339

pyOpenAnnotate....339

Data labeling using Azure Machine Learning....339

Benefits of data labeling with Azure Machine Learning....340

Data labeling steps using Azure Machine Learning....340

Image data labeling with Azure Machine Learning....341

Text data labeling with Azure Machine Learning....355

Audio data labeling using Azure Machine Learning....363

Integration of the Azure Machine Learning pipeline with the labeled dataset....366

Exploring Label Studio....367

Labeling the image data....368

Labeling the text data....372

Labeling the video data....373

pyOpenAnnotate....374

Computer Vision Annotation Tool....376

Comparison of data labeling tools....377

Advanced methods in data labeling....378

Active learning....379

Semi-automated labeling....379

Summary....380

Index....382

About PACKT....394

Other Books You May Enjoy....395

Data labeling is the invisible hand that guides the power of artificial intelligence and machine learning. In today's data-driven world, mastering data labeling is not just an advantage, it's a necessity. Data Labeling in Machine Learning with Python empowers you to unearth value from raw data, create intelligent systems, and influence the course of technological evolution.

With this book, you'll discover the art of employing summary statistics, weak supervision, programmatic rules, and heuristics to assign labels to unlabeled training data programmatically. As you progress, you'll be able to enhance your datasets by mastering the intricacies of semi-supervised learning and data augmentation. Venturing further into the data landscape, you'll immerse yourself in the annotation of image, video, and audio data, harnessing the power of Python libraries such as seaborn, matplotlib, cv2, librosa, openai, and langchain. With hands-on guidance and practical examples, you'll gain proficiency in annotating diverse data types effectively.

By the end of this book, you'll have the practical expertise to programmatically label diverse data types and enhance datasets, unlocking the full potential of your data.

What you will learn

  • Excel in exploratory data analysis (EDA) for tabular, text, audio, video, and image data
  • Understand how to use Python libraries to apply rules to label raw data
  • Discover data augmentation techniques for adding classification labels
  • Leverage K-means clustering to classify unsupervised data
  • Explore how hybrid supervised learning is applied to add labels for classification
  • Master text data classification with generative AI
  • Detect objects and classify images with OpenCV and YOLO
  • Uncover a range of techniques and resources for data annotation

Who this book is for

This book is for machine learning engineers, data scientists, and data engineers who want to learn data labeling methods and algorithms for model training. Data enthusiasts and Python developers will be able to use this book to learn data exploration and annotation using Python libraries. Basic Python knowledge is beneficial but not necessary to get started.


Похожее:

Список отзывов:

Нет отзывов к книге.