1. Training Data Introduction....8
Training Data Intents....10
What Can You Do With Training Data?....11
What is Training Data Most Concerned With?....12
Training Data Opportunities....23
Business Transformation....24
Training Data Efficiency....26
Tooling Proficiency....27
Common Pain Points....27
Why Training Data Matters....27
ML Applications are Becoming Mainstream....29
The Foundation of Successful AI....30
Training Data is Here to Stay....31
Training Data Controls the ML Program....32
New Types of Users....34
Training Data in the Wild....36
What Makes Training Data Difficult?....36
The Art of Supervising Machines....38
A New Thing....39
Media Types....41
ML Program Ecosystem....42
Data-Centric Machine Learning....42
Failures....43
Failing to Achieve the Desired Bias....46
What Training Data Is Not....47
Summary....48
2. Getting Up and Running....51
Introduction....51
Getting Up and Running....53
Installation....54
Annotation Setup....56
End User Setup....57
Data Setup....58
Workflow Setup....58
Data Catalog Setup....59
Initial Usage....59
Optimization....60
Tools Overview....60
Annotation....61
Catalog....61
Workflow....61
Training Data for Machine Learning....62
Growing Selection of Tools....63
People, Process, and Data....63
Embedded....64
Best Practices and Levels of Competency....65
Human Computer Supervision....65
Separation of End Concerns....66
Standards....66
Expansive Tooling....67
A Paradigm to Deliver Machine Learning Software....68
Trade-offs....69
Costs....69
Installed vs Software as a Service....70
Development System....72
Scale....74
Installation Options....82
Annotation Interfaces....85
Modeling Integration....86
Multi-User vs Single User....87
Integrations....87
Scope....88
Hidden Assumptions....95
Security....97
Open Source and Closed Source....102
History....108
Open Source Standards....108
Realizing the Need for Dedicated Tooling....109
Suite....112
Summary....113
3. Schema....116
Schema Deep Dive Introduction....116
Labels and Attributes....118
What Do We Care About?....118
Introduction to Labels....118
Attributes Introduction....120
Relationship to Spatial Types....126
Importance of What It Is....130
Technical Specifications....134
Where Is It? - Spatial Representation....143
Computer Vision Spatial Types....144
Lines and Curves....148
Types with Multiple Uses....149
Complex Spatial Types....150
Trade Offs with Types for Architecture and Creation....150
Trade Offs with Types for Usage....151
When Is It? - Relationships, Sequences, Time Series....152
Sequences and Relationships....152
When....152
Guides, Instructions....153
Judgment Calls....155
Choosing Good Names....156
Relation of Machine Learning Tasks to Training Data....156
Tasks....157
Chart - Relationship of Tasks to Training Data Types....163
General Concepts....165
Instance Concept Refresher....165
Upgrading Data Over Time....166
The Boundary Between Modeling and Training Data....167
Raw Data Concepts....169
Summary....171
4. Data Engineering....174
Introduction....174
Who Wants The Data?....176
A Game of Telephone....178
Planning A Great System....181
Naive & Training Data Centric approaches....183
Raw Data Storage....192
By Reference or by Value....193
Off-the-shelf dedicated Training Data tooling on your own hardware....195
Data storage....195
Where does the data rest?....195
Bucket connection....196
Raw Media (BLOB) Type Specific....197
Formatting & Mapping....199
User Defined Types (Compound Files)....200
Defining DataMaps....200
Ingest Wizards....201
Organizing Data and Useful Storage....202
Remote Storage....203
Versioning....204
Data Access....207
Disambiguating Storage, Ingestion, Export, and Access....208
File Based Exports....209
Streaming Data....210
Queries Introduction....212
Integrations with Ecosystem....212
Security....213
Access Control....214
Signed URLs....215
Pre-Label....219
Updating Data....219
Pre-Label Gotchas....220
Pre-Label data prep process....221
5. Annotation Automation....225
Introduction....225
Getting Started....227
Motivation: When to use these methods?....228
What do people actually use?....230
What kind of results can I expect?....232
Common Confusions....235
Risks....236
Costs Expected....237
Pre-Labeling....243
Standard Pre-Labeling....243
Micro Model Pre-Label....247
Quality Assurance Pre-Labeling....250
How to get started Pre-Labeling....251
Interactive Annotation Automation....251
Introduction....251
Interactive on Drawing Warm up....255
Interactive Capturing of a Region of Interest....258
Interactive Drawing Box to Polygon Using Grabcut....259
Full Image Model Prediction Example....260
How to get started with Interactive....262
Quality Assurance (QA) Automation....262
Using the Model to Debug The Humans....262
Automated Checklist Example....263
Checks based on looking at the data of samples....264
Data Discovery - What to Label Exploration....264
Choosing Based on Data....265
Choosing Based on MetaData....267
Simulation & Synthetic Data....268
Simulations are not perfect - Training Data still needs human review....270
Media Specific....273
What methods work with which media?....273
Video Specific....275
Polygon and Segmentation Specific....276
Language (NLP) Specific....277
Augmentation....277
Better Models are Better than Better Augmentation....278
To Augment or Not To Augment....278
Domain Specific....281
Geometry Based Labeling....281
Heuristic Based Labeling....283
6. Tools....285
Introduction....285
Why Training Data Tools....287
What do Training Data Tools Do?....288
Best practices and levels of competency....289
Human Computer Supervision....289
Tools Bring Clarity....290
Understanding the Importance of Tooling....291
Realizing the Need for Dedicated Tooling....293
More Usage, More Demands....294
Advent of New Standards....295
Journey to the Suite....297
Open Source Standards....300
A paradigm to deliver machine learning software....300
Scale....303
Why is it useful to define scale?....303
Rules of Thumb....304
Transitioning from small to medium scale....306
Build, Buy, or Customize....307
Major Scale Thoughts....307
Scope....309
Point Solutions....310
Tools in between....312
Platforms and Suites....313
Where is the Machine Learning?....316
Tooling quickstart....316
#1 Choose an open source tool to get up and running quickly.....317
#2 Try multiple, choose only one....317
#3 Use UI based wizards as much as possible.....317
Training Data Tooling Hidden Assumptions....318
True: Meet the Team....318
True: You have someone technical on your team....318
True: You have an ongoing project....319
True: You have a budget....319
True: You have time....319
False: You must use Graphics Processing Units GPUs....319
False: You must use automations....319
False: It’s all about the annotation UI....320
Security....320
Security Architecture....320
Attack Surface....321
Data Access....321
Human Access....322
Identity Access Management (IAM) bucket delegation schemes....322
In contrast with an installed solution....323
Annotator Access....323
Data Science Access....325
Root Level Access....326
Open Source and Closed Source....326
Deployment....328
Client Installed Deployment vs Software as a Service....328
Costs....330
Annotation Interfaces....330
User Experiences....332
Modeling Integration....332
Multi-User vs Single User....333
Integrations....333
Ease of Use....334
Annotator Ease of Use....334
Ergonomics of Labeling....336
Installation and organization....342
Docker....343
Docker Compose....343
Kubernetes....343
Configuration Choices....344
Storing Individual Frames (Video Specific)....345
Versioning Resolution....345
Retention Period....345
Bias in training data....346
The technical concept of Bias....346
This isn’t your grandfather’s Bias....346
Desirable Bias....347
Bias is hard to escape....348
Metadata....349
Lost Metadata....350
7. AI Transformation....352
AI Transformation Introduction....352
Getting Started....355
Seeing your Day to Day Work As Annotation....355
The Creative Revolution of Data Centric AI....359
The critical realization: you can create new data....359
You can change what data you collect....361
You can change the meaning of the data....361
You can create!....362
Think Step Function Improvement....362
Appoint a Leader: a Director of Training Data....364
Go From a Work Pool to Standard Expectation for All....364
Sometimes Proposals and Corrections, Sometimes Replacement....367
Upstream Producers and Downstream Consumers....368
Reading this Chart....377
Spectrum of Training Data Team Engagement....378
Dedicated Producers and Other Teams....378
Organizing Producers from Other Teams....379
Securing your AI Future....384
Use Case Discovery....384
Rubric for Good Use Cases....385
Evaluating Use Case Against the Rubric....393
Conceptual Effects of Use Cases....398
Rethink AI Annotation Talent - quality over quantity....405
Key Levers on Training Data ROI....406
Let’s think about what the Annotated Data Represents....406
Benefits of controlling your own training data....407
The Need for Hardware....407
Common Project Mistakes....408
Adopt Modern Training Data Tools....409
Business Models....410
Think Learning Curve not Perfection....410
New Training and Knowledge are Required....412
Producing And Consuming Training Data....413
Trap to Avoid: Premature Optimization in Training Data....414
No Silver Bullets....421
Culture of Training Data....422
New Engineering Principles....424
About the Author....426
Your training data has as much to do with the success of your data project as the algorithms themselves because most failures in AI systems relate to training data. But while training data is the foundation for successful AI and machine learning, there are few comprehensive resources to help you ace the process.
In this hands-on guide, author Anthony Sarkis--lead engineer for the Diffgram AI training data software--shows technical professionals, managers, and subject matter experts how to work with and scale training data, while illuminating the human side of supervising machines. Engineering leaders, data engineers, and data science professionals alike will gain a solid understanding of the concepts, tools, and processes they need to succeed with training data.