Praise for Hacks, Leaks, and Revelations....12
Title Page....14
Copyright....15
Dedication....17
About the Author and Technical Reviewer....18
Acknowledgments....19
Introduction....20
Why I Wrote This Book....21
What You’ll Learn....22
What You’ll Need....26
Part I: Sources and Datasets....28
1. Protecting Sources and Yourself....29
Safely Communicating with Sources....30
Working with Public Data....31
Protecting Sensitive Information....31
Minimizing the Digital Trail....32
Working with Hackers and Whistleblowers....33
Secure Storage for Datasets....34
Low-Sensitivity Datasets....34
Medium-Sensitivity Datasets....35
High-Sensitivity Datasets....36
Authenticating Datasets....38
The AFLDS Dataset....39
The WikiLeaks Twitter Group Chat....40
Redaction....44
What Data to Publish....44
What to Redact....45
Making Requests for Comment....48
Password Managers....49
Disk Encryption....53
Exercise 1-1: Encrypt Your Internal Disk....54
Windows....55
macOS....57
Linux....57
Exercise 1-2: Encrypt a USB Disk....58
Windows....59
macOS....62
Linux....62
Protecting Yourself from Malicious Documents....63
Exercise 1-3: Install and Use Dangerzone....65
Summary....67
2. Acquiring Datasets....69
The End of WikiLeaks....70
Distributed Denial of Secrets....72
Downloading Datasets with BitTorrent....73
The Origins of BlueLeaks....75
Exercise 2-1: Download the BlueLeaks Dataset....76
Communicating with Encrypted Messaging Apps....77
Exercise 2-2: Install and Practice Using Signal....80
Encrypting Messages with PGP....80
Staying Anonymous Online with Tor and OnionShare....81
Exercise 2-3: Play with Tor and OnionShare....86
Communicating with My Tea Party Patriots Source....87
Other Options for Acquiring Datasets from Sources....88
Encrypted USB Drives....89
Virtual Private Servers....90
Whistleblower Submission Systems....91
Summary....92
Part II: Tools of the Trade....93
3. The Command Line Interface....94
Introducing the Command Line....95
The Shell....95
Users and Paths....96
User Privileges....97
Exercise 3-1: Install Ubuntu in Windows....99
Basic Command Line Usage....102
Opening a Terminal....103
Clearing Your Screen and Exiting the Shell....104
Exploring Files and Directories....104
Navigating Relative and Absolute Paths....107
Changing Directories....107
Using the help Argument....109
Accessing Man Pages....110
Tips for Navigating the Terminal....110
Entering Commands with Tab Completion....110
Editing Commands....112
Dealing with Spaces in Filenames....112
Using Single Quotes Around Double Quotes....114
Installing and Uninstalling Software with Package Managers....115
Exercise 3-2: Manage Packages with Homebrew on macOS....116
Exercise 3-3: Manage Packages with apt on Windows or Linux....119
Exercise 3-4: Practice Using the Command Line with cURL....122
Download a Web Page with cURL....122
Save a Web Page to a File....123
Text Files vs. Binary Files....124
Exercise 3-5: Install the VS Code Text Editor....125
Exercise 3-6: Write Your First Shell Script....127
Navigate to Your USB Disk....127
Create an Exercises Folder....128
Open a VS Code Workspace....129
Write the Shell Script....130
Run the Shell Script....131
Exercise 3-7: Clone the Book’s GitHub Repository....133
Summary....134
4. Exploring Datasets in the Terminal....135
Introducing for Loops....135
Exercise 4-1: Unzip the BlueLeaks Dataset....138
Unzip Files on macOS or Linux....138
Unzip Files on Windows....141
Organize Your Files....143
How the Hacker Obtained the BlueLeaks Data....144
Exercise 4-2: Explore BlueLeaks on the Command Line....146
Calculate How Much Disk Space Folders Use....146
Use Pipes and Sort Output....148
Create an Inventory of Filenames in a Dataset....150
Count the Files in a Dataset....151
Exercise 4-3: Find Revelations in BlueLeaks with grep....152
Filter for Documents Mentioning Antifa....152
Filter for Certain Types of Files....154
Use grep with Regular Expressions....155
Search Files in Bulk with grep....156
Encrypted Data in the BlueLeaks Dataset....159
Data Analysis with Servers in the Cloud....161
Exercise 4-4: Set Up a VPS....164
Generate an SSH Key....164
Add Your Public Key to the Cloud Provider....165
Create a VPS....166
SSH into Your Server....167
Start a Byobu Session....169
Install Updates....170
Exercise 4-5: Explore the Oath Keepers Dataset Remotely....170
Summary....177
5. Docker, Aleph, and Making Datasets Searchable....178
Introducing Docker and Linux Containers....179
Exercise 5-1: Initialize Docker Desktop on Windows and macOS....180
Exercise 5-2: Initialize Docker Engine on Linux....181
Running Containers with Docker....183
Running an Ubuntu Container....183
Listing and Killing Containers....185
Mounting and Removing Volumes....186
Passing Environment Variables....192
Running Server Software....192
Freeing Up Disk Space....195
Exercise 5-3: Run a WordPress Site with Docker Compose....196
Make a docker-compose.yaml File....196
Start Your WordPress Site....198
Introducing Aleph....200
Exercise 5-4: Run Aleph Locally in Linux Containers....201
Using Aleph’s Web and Command Line Interfaces....204
Indexing Data in Aleph....206
Exercise 5-5: Index a BlueLeaks Folder in Aleph....207
Mount Your Datasets into the Aleph Shell....207
Index the icefishx Folder....208
Check Indexing Status....209
Explore BlueLeaks with Aleph....212
Additional Aleph Features....214
Dedicated Aleph Servers....216
Summary....218
6. Reading Other People’s Email....219
The Email Protocol and Message Structure....220
File Formats for Email Dumps....222
EML Files....222
MBOX Files....223
PST Outlook Data Files....223
Exercise 6-1: Download Email Dumps from Three Datasets....224
The Nauru Police Force Dataset....224
The Oath Keepers Dataset....225
The Heritage Foundation Dataset....225
Researching Email Dumps with Thunderbird....226
Exercise 6-2: Configure Thunderbird for Email Dumps....227
Reading Individual EML Files with Thunderbird....228
Exercise 6-3: Import the Nauru Police Force EML Email Dump....229
Searching Email in Thunderbird....231
Quick Filter Searches....232
The Search Messages Dialog....232
Exercise 6-4: Import the Oath Keepers MBOX Email Dump....233
Exercise 6-5: Import the Heritage Foundation PST Email Dump....234
Other Tools for Researching Email Dumps....237
Microsoft Outlook....237
Aleph....240
Summary....241
Part III: Python Programming....243
7. An Introduction to Python....244
Exercise 7-1: Install Python....245
Windows....245
Linux....245
macOS....245
Exercise 7-2: Write Your First Python Script....246
Python Basics....247
The Interactive Python Interpreter....248
Comments....248
Math with Python....249
Strings....252
Exercise 7-3: Write a Python Script with Variables, Math, and Strings....253
Lists and Loops....255
Defining and Printing Lists....256
Running for Loops....259
Control Flow....261
Comparison Operators....262
if Statements....263
Nested Code Blocks....265
Searching Lists....266
Logical Operators....267
Exception Handling....269
Exercise 7-4: Practice Loops and Control Flow....272
Functions....275
The def Keyword....275
Default Arguments....277
Return Values....278
Docstrings....281
Exercise 7-5: Practice Writing Functions....282
Summary....283
8. Working with Data in Python....284
Modules....284
Python Script Template....287
Exercise 8-1: Traverse the Files in BlueLeaks....288
List the Filenames in a Folder....288
Count the Files and Folders in a Folder....290
Traverse Folders with os.walk()....292
Exercise 8-2: Find the Largest Files in BlueLeaks....294
Third-Party Modules....297
Exercise 8-3: Practice Command Line Arguments with Click....300
Avoiding Hardcoding with Command Line Arguments....302
Exercise 8-4: Find the Largest Files in Any Dataset....303
Dictionaries....305
Defining Dictionaries....305
Getting and Setting Values....306
Navigating Dictionaries and Lists in the Conti Chat Logs....308
Exploring Dictionaries and Lists Full of Data in Python....308
Selecting Values in Dictionaries and Lists....312
Analyzing Data Stored in Dictionaries and Lists....313
Exercise 8-5: Map Out the CSVs in BlueLeaks....318
Accept a Command Line Argument....319
Loop Through the BlueLeaks Folders....320
Fill Up the Dictionary....321
Display the Output....324
Reading and Writing Files....326
Opening Files....326
Writing Lines to a File....327
Reading Lines from a File....328
Exercise 8-6: Practice Reading and Writing Files....329
Summary....332
Part IV: Structured Data....333
9. Blueleaks, Black Lives Matter, and the CSV File Format....334
Installing Spreadsheet Software....335
Introducing the CSV File Format....336
Exploring CSV Files with Spreadsheet Software and Text Editors....338
My BlueLeaks Investigation....342
Focusing on a Fusion Center....342
Introducing NCRIC....343
Investigating a SAR....343
Reading and Writing CSV Files in Python....349
Exercise 9-1: Make BlueLeaks CSVs More Readable....351
Accept the CSV Path as an Argument....352
Loop Through the CSV Rows....353
Display CSV Fields on Separate Lines....354
How to Read Bulk Email from Fusion Centers....357
Lists of Black Lives Matter Demonstrations....358
“Intelligence” Memos from the FBI and DHS....363
A Brief HTML Primer....365
Exercise 9-2: Make Bulk Email Readable....367
Accept the Command Line Arguments....368
Create the Output Folder....369
Define the Filename for Each Row....370
Write the HTML Version of Each Bulk Email....372
Discovering the Names and URLs of BlueLeaks Sites....379
Exercise 9-3: Make a CSV of BlueLeaks Sites....381
Open a CSV for Writing....382
Find All the Company.csv Files....383
Add BlueLeaks Sites to the CSV....385
Summary....388
10. Blueleaks Explorer....389
Undiscovered Revelations in BlueLeaks....390
Exercise 10-1: Install BlueLeaks Explorer....391
Create the Docker Compose Configuration File....391
Bring Up the Containers....392
Initialize the Databases....393
The Structure of NCRIC....395
Exploring Tables and Relationships....396
Searching for Keywords....399
Building Your Own BlueLeaks Structure....400
Defining the JRIC Structure....401
Showing Useful Fields....404
Changing Field Types....407
Adding JRIC’s Leads Table....409
Building a Relationship....411
Verifying BlueLeaks Data....414
Exercise 10-2: Finish Building the Structure for JRIC....416
The Technology Behind BlueLeaks Explorer....417
The Backend....418
The Frontend....418
Summary....419
11. Parler, the January 6 Insurrection, and the JSON File format....420
The Origins of the Parler Dataset....421
How the Parler Videos Were Archived....421
The Dataset’s Impact on Trump’s Second Impeachment....423
Exercise 11-1: Download and Extract Parler Video Metadata....424
Download the Metadata....424
Uncompress and Download Individual Parler Videos....426
Extract Parler Metadata....429
The JSON File Format....431
Understanding JSON Syntax....432
Parsing JSON with Python....435
Handling Exceptions with JSON....438
Tools for Exploring JSON Data....440
Counting Videos with GPS Coordinates Using grep....440
Formatting and Searching Data with the jq Command....442
Exercise 11-2: Write a Script to Filter for Videos with GPS from January 6, 2021....444
Accept the Parler Metadata Path as an Argument....445
Loop Through Parler Metadata Files....446
Filter for Videos with GPS Coordinates....448
Filter for Videos from January 6, 2021....450
Working with GPS Coordinates....452
Searching by Latitude and Longitude....452
Converting Between GPS Coordinate Formats....454
Calculating GPS Distance in Python....457
Finding the Center of Washington, DC....459
Exercise 11-3: Update the Script to Filter for Insurrection Videos....460
Plotting GPS Coordinates on a Map with simplekml....465
Exercise 11-4: Create KML Files to Visualize Location Data....467
Create a KML File for All Videos with GPS Coordinates....469
Create KML Files for Videos from January 6, 2021....473
Visualizing Location Data with Google Earth....476
Viewing Metadata with ExifTool....481
Summary....483
12. Epik Fail, Extremism Research, and SQL Databases....484
The Structure of SQL Databases....485
Relational Databases....486
Clients and Servers....487
Tables, Columns, and Types....489
Exercise 12-1: Create and Test a MySQL Server Using Docker and Adminer....490
Run the Server....490
Connect to the Database with Adminer....492
Create a Test Database....493
Exercise 12-2: Query Your SQL Database....494
INSERT Statements....495
SELECT Statements....497
JOIN Clauses....503
UPDATE Statements....508
DELETE Statements....508
Introducing the MySQL Command Line Client....509
Exercise 12-3: Install and Test the Command Line MySQL Client....510
MySQL-Specific Queries....512
The History of Epik....515
The Epik Hack....516
Epik’s WHOIS Data....518
Exercise 12-4: Download and Extract Part of the Epik Dataset....521
Exercise 12-5: Import Epik Data into MySQL....522
Create a Database for api_system....522
Import api_system Data....522
Exploring Epik’s SQL Database....524
The domain Table....525
The privacy Table....527
The hosting and hosting_server Tables....530
Working with Epik Data in the Cloud....532
Summary....535
Part V: Case Studies....537
13. Pandemic Profiteers and Covid-19 Disinformation....538
The Origins of AFLDS....540
The Cadence Health and Ravkoo Datasets....543
Extracting the Data into an Encrypted File Container....543
Analyzing the Data with Command Line Tools....545
Creating a Single Spreadsheet of Patients....554
Calculating Revenue from Prescriptions Filled by Ravkoo....560
Finding the Price and Quantity of Drugs Sold....560
Categorizing Prescription Data by Drug....564
A Deeper Look at the Cadence Health Patient Data....568
Finding Cadence’s Partners....568
Searching for Patients by City....572
Searching for Patients by Age....577
Authenticating the Data....582
The Aftermath....584
HIPAA’s Breach Notification Rule....585
Congressional Investigation....585
Simone Gold’s New Business Venture....586
Scandal and Infighting at AFLDS....587
Summary....588
14. Neo-Nazis and their Chatrooms....589
How Antifascists Infiltrated Neo-Nazi Discord Servers....591
Analyzing Leaked Chat Logs....592
Making JSON Files Readable....593
Exploring Objects, Keys, and Values with jq....594
Converting Timestamps....601
Finding Usernames....602
The Discord History Tracker....604
A Script to Search the JSON Files....607
My Discord Analysis Code....613
Designing the SQL Database....614
Importing Chat Logs into the SQL Database....619
Building the Web Interface....627
Using Discord Analysis to Find Revelations....635
The Pony Power Discord Server....639
The Launch of DiscordLeaks....643
The Aftermath....644
The Lawsuit Against Unite the Right....645
The Patriot Front Chat Logs....646
Summary....647
Afterword....648
A. Solutions to Common WSL Problems....650
Understanding WSL’s Linux Filesystem....651
The Disk Performance Problem....654
Solving the Disk Performance Problem....655
Storing Only Active Datasets in Linux....655
Storing Your Linux Filesystem on a USB Disk....656
Next Steps....662
B. Scraping the Web....664
Legal Considerations....665
HTTP Requests....666
Scraping Techniques....667
Loading Pages with HTTPX....667
Parsing HTML with Beautiful Soup....674
Automating Web Browsers with Selenium....682
Next Steps....689
Index....691
Data-science investigations have brought journalism into the 21st century, and—guided by The Intercept’s infosec expert Micah Lee— this book is your blueprint for uncovering hidden secrets in hacked datasets.Unlock the internet’s treasure trove of public interest data with Hacks, Leaks, and Revelations by Micah Lee, an investigative reporter and security engineer. This hands-on guide blends real-world techniques for researching large datasets with lessons on coding, data authentication, and digital security. All of this is spiced up with gripping stories from the front lines of investigative journalism.Dive into exposed datasets from a wide array of sources: the FBI, the DHS, police intelligence agencies, extremist groups like the Oath Keepers, and even a Russian ransomware gang. Lee’s own in-depth case studies on disinformation-peddling pandemic profiteers and neo-Nazi chatrooms serve as blueprints for your research.Gain practical skills in searching massive troves of data for keywords like “antifa” and pinpointing documents with newsworthy revelations. Get a crash course in Python to automate the analysis of millions of files.
We live in an age where hacking and whistleblowing can unearth secrets that alter history. Hacks, Leaks, and Revelations is your toolkit for uncovering new stories and hidden truths. Crack open your laptop, plug in a hard drive, and get ready to change history.