Training Data for Machine Learning: Human Supervision from Annotation to Data Science

Training Data for Machine Learning: Human Supervision from Annotation to Data Science

Training Data for Machine Learning: Human Supervision from Annotation to Data Science
Автор: Sarkis Anthony
Дата выхода: 2023
Издательство: O’Reilly Media, Inc.
Количество страниц: 426
Размер файла: 1.9 MB
Тип файла: PDF
Добавил: Aleks-5
 Проверить на вирусы

1. Training Data Introduction....8

Training Data Intents....10

What Can You Do With Training Data?....11

What is Training Data Most Concerned With?....12

Training Data Opportunities....23

Business Transformation....24

Training Data Efficiency....26

Tooling Proficiency....27

Common Pain Points....27

Why Training Data Matters....27

ML Applications are Becoming Mainstream....29

The Foundation of Successful AI....30

Training Data is Here to Stay....31

Training Data Controls the ML Program....32

New Types of Users....34

Training Data in the Wild....36

What Makes Training Data Difficult?....36

The Art of Supervising Machines....38

A New Thing....39

Media Types....41

ML Program Ecosystem....42

Data-Centric Machine Learning....42

Failures....43

Failing to Achieve the Desired Bias....46

What Training Data Is Not....47

Summary....48

2. Getting Up and Running....51

Introduction....51

Getting Up and Running....53

Installation....54

Annotation Setup....56

End User Setup....57

Data Setup....58

Workflow Setup....58

Data Catalog Setup....59

Initial Usage....59

Optimization....60

Tools Overview....60

Annotation....61

Catalog....61

Workflow....61

Training Data for Machine Learning....62

Growing Selection of Tools....63

People, Process, and Data....63

Embedded....64

Best Practices and Levels of Competency....65

Human Computer Supervision....65

Separation of End Concerns....66

Standards....66

Expansive Tooling....67

A Paradigm to Deliver Machine Learning Software....68

Trade-offs....69

Costs....69

Installed vs Software as a Service....70

Development System....72

Scale....74

Installation Options....82

Annotation Interfaces....85

Modeling Integration....86

Multi-User vs Single User....87

Integrations....87

Scope....88

Hidden Assumptions....95

Security....97

Open Source and Closed Source....102

History....108

Open Source Standards....108

Realizing the Need for Dedicated Tooling....109

Suite....112

Summary....113

3. Schema....116

Schema Deep Dive Introduction....116

Labels and Attributes....118

What Do We Care About?....118

Introduction to Labels....118

Attributes Introduction....120

Relationship to Spatial Types....126

Importance of What It Is....130

Technical Specifications....134

Where Is It? - Spatial Representation....143

Computer Vision Spatial Types....144

Lines and Curves....148

Types with Multiple Uses....149

Complex Spatial Types....150

Trade Offs with Types for Architecture and Creation....150

Trade Offs with Types for Usage....151

When Is It? - Relationships, Sequences, Time Series....152

Sequences and Relationships....152

When....152

Guides, Instructions....153

Judgment Calls....155

Choosing Good Names....156

Relation of Machine Learning Tasks to Training Data....156

Tasks....157

Chart - Relationship of Tasks to Training Data Types....163

General Concepts....165

Instance Concept Refresher....165

Upgrading Data Over Time....166

The Boundary Between Modeling and Training Data....167

Raw Data Concepts....169

Summary....171

4. Data Engineering....174

Introduction....174

Who Wants The Data?....176

A Game of Telephone....178

Planning A Great System....181

Naive & Training Data Centric approaches....183

Raw Data Storage....192

By Reference or by Value....193

Off-the-shelf dedicated Training Data tooling on your own hardware....195

Data storage....195

Where does the data rest?....195

Bucket connection....196

Raw Media (BLOB) Type Specific....197

Formatting & Mapping....199

User Defined Types (Compound Files)....200

Defining DataMaps....200

Ingest Wizards....201

Organizing Data and Useful Storage....202

Remote Storage....203

Versioning....204

Data Access....207

Disambiguating Storage, Ingestion, Export, and Access....208

File Based Exports....209

Streaming Data....210

Queries Introduction....212

Integrations with Ecosystem....212

Security....213

Access Control....214

Signed URLs....215

Pre-Label....219

Updating Data....219

Pre-Label Gotchas....220

Pre-Label data prep process....221

5. Annotation Automation....225

Introduction....225

Getting Started....227

Motivation: When to use these methods?....228

What do people actually use?....230

What kind of results can I expect?....232

Common Confusions....235

Risks....236

Costs Expected....237

Pre-Labeling....243

Standard Pre-Labeling....243

Micro Model Pre-Label....247

Quality Assurance Pre-Labeling....250

How to get started Pre-Labeling....251

Interactive Annotation Automation....251

Introduction....251

Interactive on Drawing Warm up....255

Interactive Capturing of a Region of Interest....258

Interactive Drawing Box to Polygon Using Grabcut....259

Full Image Model Prediction Example....260

How to get started with Interactive....262

Quality Assurance (QA) Automation....262

Using the Model to Debug The Humans....262

Automated Checklist Example....263

Checks based on looking at the data of samples....264

Data Discovery - What to Label Exploration....264

Choosing Based on Data....265

Choosing Based on MetaData....267

Simulation & Synthetic Data....268

Simulations are not perfect - Training Data still needs human review....270

Media Specific....273

What methods work with which media?....273

Video Specific....275

Polygon and Segmentation Specific....276

Language (NLP) Specific....277

Augmentation....277

Better Models are Better than Better Augmentation....278

To Augment or Not To Augment....278

Domain Specific....281

Geometry Based Labeling....281

Heuristic Based Labeling....283

6. Tools....285

Introduction....285

Why Training Data Tools....287

What do Training Data Tools Do?....288

Best practices and levels of competency....289

Human Computer Supervision....289

Tools Bring Clarity....290

Understanding the Importance of Tooling....291

Realizing the Need for Dedicated Tooling....293

More Usage, More Demands....294

Advent of New Standards....295

Journey to the Suite....297

Open Source Standards....300

A paradigm to deliver machine learning software....300

Scale....303

Why is it useful to define scale?....303

Rules of Thumb....304

Transitioning from small to medium scale....306

Build, Buy, or Customize....307

Major Scale Thoughts....307

Scope....309

Point Solutions....310

Tools in between....312

Platforms and Suites....313

Where is the Machine Learning?....316

Tooling quickstart....316

#1 Choose an open source tool to get up and running quickly.....317

#2 Try multiple, choose only one....317

#3 Use UI based wizards as much as possible.....317

Training Data Tooling Hidden Assumptions....318

True: Meet the Team....318

True: You have someone technical on your team....318

True: You have an ongoing project....319

True: You have a budget....319

True: You have time....319

False: You must use Graphics Processing Units GPUs....319

False: You must use automations....319

False: It’s all about the annotation UI....320

Security....320

Security Architecture....320

Attack Surface....321

Data Access....321

Human Access....322

Identity Access Management (IAM) bucket delegation schemes....322

In contrast with an installed solution....323

Annotator Access....323

Data Science Access....325

Root Level Access....326

Open Source and Closed Source....326

Deployment....328

Client Installed Deployment vs Software as a Service....328

Costs....330

Annotation Interfaces....330

User Experiences....332

Modeling Integration....332

Multi-User vs Single User....333

Integrations....333

Ease of Use....334

Annotator Ease of Use....334

Ergonomics of Labeling....336

Installation and organization....342

Docker....343

Docker Compose....343

Kubernetes....343

Configuration Choices....344

Storing Individual Frames (Video Specific)....345

Versioning Resolution....345

Retention Period....345

Bias in training data....346

The technical concept of Bias....346

This isn’t your grandfather’s Bias....346

Desirable Bias....347

Bias is hard to escape....348

Metadata....349

Lost Metadata....350

7. AI Transformation....352

AI Transformation Introduction....352

Getting Started....355

Seeing your Day to Day Work As Annotation....355

The Creative Revolution of Data Centric AI....359

The critical realization: you can create new data....359

You can change what data you collect....361

You can change the meaning of the data....361

You can create!....362

Think Step Function Improvement....362

Appoint a Leader: a Director of Training Data....364

Go From a Work Pool to Standard Expectation for All....364

Sometimes Proposals and Corrections, Sometimes Replacement....367

Upstream Producers and Downstream Consumers....368

Reading this Chart....377

Spectrum of Training Data Team Engagement....378

Dedicated Producers and Other Teams....378

Organizing Producers from Other Teams....379

Securing your AI Future....384

Use Case Discovery....384

Rubric for Good Use Cases....385

Evaluating Use Case Against the Rubric....393

Conceptual Effects of Use Cases....398

Rethink AI Annotation Talent - quality over quantity....405

Key Levers on Training Data ROI....406

Let’s think about what the Annotated Data Represents....406

Benefits of controlling your own training data....407

The Need for Hardware....407

Common Project Mistakes....408

Adopt Modern Training Data Tools....409

Business Models....410

Think Learning Curve not Perfection....410

New Training and Knowledge are Required....412

Producing And Consuming Training Data....413

Trap to Avoid: Premature Optimization in Training Data....414

No Silver Bullets....421

Culture of Training Data....422

New Engineering Principles....424

About the Author....426

Your training data has as much to do with the success of your data project as the algorithms themselves because most failures in AI systems relate to training data. But while training data is the foundation for successful AI and machine learning, there are few comprehensive resources to help you ace the process.

In this hands-on guide, author Anthony Sarkis--lead engineer for the Diffgram AI training data software--shows technical professionals, managers, and subject matter experts how to work with and scale training data, while illuminating the human side of supervising machines. Engineering leaders, data engineers, and data science professionals alike will gain a solid understanding of the concepts, tools, and processes they need to succeed with training data.

With this book, you'll learn how to:

  • Work effectively with training data including schemas, raw data, and annotations
  • Transform your work, team, or organization to be more AI/ML data-centric
  • Clearly explain training data concepts to other staff, team members, and stakeholders
  • Design, deploy, and ship training data for production-grade AI applications
  • Recognize and correct new training-data-based failure modes such as data bias
  • Confidently use automation to more effectively create training data
  • Successfully maintain, operate, and improve training data systems of record

Похожее:

Список отзывов:

Нет отзывов к книге.