Front Cover....1
Essentials of Big Data Analytics....4
Copyright....5
Contents....8
Preface....12
Acknowledgments....14
Introduction....16
1 Introduction to big data analytics....18
1.1 Understanding big data....18
1.1.1 Definition and characteristics of big data....18
1.1.2 Volume, velocity, variety, veracity, and value (5Vs)....19
1.1.3 Real-time processing challenges....23
1.2 Types of big data....25
1.2.1 Classifying data into structured, unstructured, and semi-structured types....25
1.2.2 Examples of each type in various industries....29
1.3 Significance and applications of big data analytics....31
1.3.1 Discussing the importance of deriving insights from big data....32
1.3.2 Applications in business, healthcare, finance, and more....32
1.3.3 Impact on decision-making and strategic planning....35
1.4 Basics of data science....36
1.4.1 Core principles and goals of data science....36
1.4.2 The data science lifecycle....37
1.4.3 Role of a data scientist....39
1.4.4 Big data and data science: a symbiotic connection....40
Exercise....41
References....42
2 Mathematical foundations....44
2.1 Statistical concepts for big data....44
2.1.1 Review of statistical fundamentals....44
2.1.2 Adaptations for handling large datasets....66
2.1.3 Significance testing and confidence intervals in big data....66
2.1.3.1 Significance testing in big data....67
2.1.3.2 Confidence intervals in big data....68
2.2 R and Python fundamentals....69
2.2.1 Basic syntax, data types, structures....70
Variables and assignments....70
Basic arithmetic operations....71
Vectorized operations (efficient computations)....71
String operations....72
Functions and basic operations in R and Python....72
Lambda functions (anonymous functions)....72
Handling missing values....73
2.2.2 Data frames, lists, matrices, vectors, and arrays....73
Data frames in R and Python....73
Lists in R and Python....74
Vectors in R and Python....75
Matrices and arrays....76
Handling large data frames in R and Python....78
2.3 Data exploration and visualization....79
2.3.1 Exploratory data analysis (EDA) with R and Python....80
2.3.2 Visualizing data using ggplot2, Matplotlib, Seaborn, and Plotly....87
ggplot2 (R)....87
Matplotlib (Python)....87
Seaborn (Python)....87
Plotly (Python & R)....88
2.3.3 Interpretation of visualizations....88
Exercise....88
References....89
3 Big data technologies and programming....90
3.1 Overview of big data technologies (Hadoop, Spark)....90
3.1.1 Introduction to Hadoop framework....90
Hadoop: distributed storage and processing....90
Hadoop Distributed File System (HDFS)....91
Hadoop components....96
3.1.2 Introduction to Spark framework....98
Advantages of Spark over Hadoop MapReduce....100
3.1.3 Use cases for each technology....100
Industry use cases for Hadoop framework....100
Industry use cases for Spark framework....101
3.2 Introduction to MapReduce....102
3.2.1 Key concepts: Map phase, Shuffle and Sort, Reduce phase....102
3.2.2 MapReduce programming model....103
Example: multiply two 2x2 matrices using MapReduce....106
3.3 R and Python as programming languages for big data....107
3.3.1 Capabilities for handling large datasets....107
3.3.2 Integrating R and Python with big data tools....108
3.4 Using Python with Hadoop streaming for word count....108
3.5 Integrating R and Python with distributed computing....110
3.5.1 Challenges of distributed R and Python computing....114
Exercise....115
References....115
4 Data ingestion and preprocessing....116
4.1 Data collection strategies....116
4.1.1 Strategies for collecting diverse data sources....116
4.1.2 Challenges in data collection and solutions....120
4.2 Data cleaning and preprocessing....121
4.2.1 Techniques for cleaning noisy or inconsistent data....121
4.2.2 Techniques for preprocessing data....133
4.2.3 Feature engineering....139
4.2.4 Feature transformation (dimensionality reduction)....145
Exercise....154
References....154
5 Big data storage and management....156
5.1 Storage architectures for big data....156
5.1.1 Overview of storage solutions like HDFS and distributed databases....156
Comparison of storage systems in big data....161
5.1.2 Choosing storage solutions based on use cases....162
Decision factors in choosing a big data storage solution....162
Use case examples....164
Use case 1: transactional data processing....164
Characteristics of transactional data processing....164
Storage solutions for transactional data processing....164
Deciding factors for choosing the right storage solution....165
Use case 2: analytical processing....165
Characteristics of analytical processing....165
Storage solutions for analytical processing....165
Deciding factors for choosing the right storage solution....166
Use case 3: data archival....166
Characteristics of data archival....166
Storage solutions for data archival....167
Deciding factors for choosing the right storage solution....167
5.2 Scalable data management....167
5.2.1 Scalability challenges and solutions....168
Solutions and future directions....169
5.2.2 Horizontal and vertical scaling concepts....169
Horizontal scaling in big data....169
Vertical scaling in big data....170
Hybrid scaling....171
Practical examples....171
5.3 Data warehousing and data lakes....172
5.3.1 Understanding data warehousing and data lakes....172
Data warehouses....172
Data warehouse characteristics....173
Why use a data warehouse?....173
Data lakes....173
Data lake characteristics....173
Why use a data lake?....174
HDFS as data lakes....174
HDFS as the foundation of data lakes....174
Key characteristics of HDFS in data lakes....174
Schema-on-read model in HDFS....175
Real-world example: Netflix and HDFS....175
Differences between data warehouses and data lakes....175
5.3.2 Integrating R and Python in analytics on data lakes....176
Using R and Python for data manipulation....176
Python libraries for big data analytics....176
R libraries for big data analytics....176
Connecting R/Python to HDFS and data lakes for data manipulation, statistical analysis, and visualization....177
Python code – connecting to data lakes (PyArrow & HDFSClient)....177
R code – connecting to data lakes (sparklyr and rhdfs)....178
Data manipulation in data lakes with R and Python....179
Machine learning in data lakes....179
Python for machine learning in data lakes....179
R for machine learning in data lakes....182
Data science with big data: use cases....185
Analytics on streaming data: real-time analytics with R/Python....186
Real-time analytics with Python....186
Real-time analytics with R....186
5.4 Case studies: practical implementations using Python....186
5.4.1 Retail company data lake case study....186
5.4.2 Financial institution data warehouse case study....189
Exercise....193
References....193
6 Advanced MapReduce for big data processing....194
6.1 Understanding MapReduce paradigm....194
6.1.1 Deep dive into the MapReduce framework....194
6.1.2 Practical use cases for MapReduce....200
6.1.2.1 Healthcare analytics....200
6.1.2.2 Financial risk assessment....201
6.2 Implementing MapReduce jobs....202
6.2.1 Step-by-step guide on writing and executing a MapReduce job....202
6.2.2 Common patterns and anti-patterns in MapReduce development....205
6.2.3 Anti-patterns in MapReduce development....208
6.3 MapReduce optimization techniques....209
6.3.1 Strategies for optimizing MapReduce jobs....209
6.3.2 Combiners, partitioning, and compression techniques....210
Exercise....211
References....211
7 Machine learning techniques for big data processing....212
7.1 Introduction to machine learning in big data context....212
7.1.1 What is machine learning?....212
Formal definition....212
Learning paradigms in machine learning....212
Big data relevance....213
7.1.2 Role of machine learning in big data analytics....213
Case study: predictive maintenance using supervised learning....214
Dataset description....214
Steps to download kaggle.json from Kaggle....215
7.1.3 Machine learning vs traditional statistical approaches....218
7.1.4 The machine learning pipeline....218
Benefits of machine learning pipeline....218
Steps to build a machine learning pipeline....219
Implementation for model training....219
7.1.5 Challenges in applying ML to big data....221
7.2 Supervised learning for big data....221
7.2.1 Overview of supervised learning....221
How supervised learning works?....222
Types of supervised learning in machine learning....222
7.2.2 Regression techniques....222
7.2.3 Classification techniques....225
7.2.4 Model evaluation and metrics....230
7.2.5 Applications in finance, healthcare, and risk management....233
7.2.6 Scalable implementations using Spark MLlib / TensorFlow....235
7.3 Unsupervised learning for big data....240
7.3.1 Introduction to unsupervised learning....240
7.3.2 Clustering techniques....240
7.3.3 Dimensionality reduction....245
What is dimensionality?....245
Why dimensionality reduction?....246
Problems in high-dimensional spaces....246
What is dimensionality reduction?....246
Two main approaches....246
Variants of autoencoders useful in big data....248
7.4 Optimization techniques in big data processing....250
7.4.1 Introduction to optimization in big data....250
Role in scalable analytics....251
Optimization in big data pipelines and ML workflows....251
7.4.2 Types of optimization techniques....251
7.4.3 Linear programming (LP)....252
Applications of LP in big data....252
Limitations of traditional LP solvers....254
Scaling LP for big data....254
7.4.4 Dynamic programming (DP)....256
Problem formulation: dynamic programming....256
Key properties of DP problems....256
Generic DP formulation....257
Example: Fibonacci recurrence....257
Big data perspective....257
Scaling dynamic programming for big data: a real-world perspective....258
Output....260
Replication distribution visualization....260
7.4.5 Goal programming (GP)....261
Mathematical formulation....261
Example: multi-objective scheduling in a big data cluster....261
Mathematical formulation....262
Scalable goal programming in big data....264
How to achieve scalability in goal programming....265
Exercise....267
References....268
8 Mining data streams....270
8.1 The stream data model....270
8.1.1 A data-stream-management system....270
Architecture of DSMS....271
8.1.2 Examples of stream sources, stream queries....271
Stream queries....272
Issues in data stream query processing....274
8.2 Sampling and filtering in data streams....274
8.2.1 Sampling data in streams....274
Varying the sample size....275
8.2.2 Filtering in data streams....275
Types of filtering....275
8.3 Algorithms for approximate data stream processing....276
8.3.1 Counting distinct elements in a stream....277
The Flajolet Martin algorithm....277
8.3.2 Counting ones in a window....279
The Datar-Gionis-Indyk-Motwani (DGIM) algorithm....279
8.3.3 Bloom filters and their analysis....282
Probability of false positivity....284
Size of bit array....284
Space efficiency....284
Choice of Hash function....285
Exercise....285
References....286
9 Case studies and practical applications....288
9.1 Industry-specific use cases....288
9.1.1 Applications in manufacturing, transportation and retail....288
9.1.1.1 Case study: GE predictive maintenance in aviation....288
Logistic regression for predictive maintenance....289
9.1.1.2 Case study: UPS Orion project....291
9.1.1.3 Case study: Walmart’s real-time replenishment system....294
9.2 Success stories in big data analytics....296
9.2.1 Vodafone – enhancing customer retention through unified analytics....297
9.2.2 CS energy – smart grid modernization using big data analytics....298
9.3 Practical implementations and challenges....300
9.3.1 Implementing solutions using R and Python....301
9.3.2 Addressing real-world challenges....304
Exercise....304
References....304
10 Hands-on exercises and tutorials with R, Python and MapReduce....306
10.1 Coding examples in R, Python, and MapReduce....306
10.1.1 Handling and analyzing large sales data with R....306
10.1.1.1 Importing and manipulating large sales data in R using data.table....306
10.1.1.2 Data transformation and aggregation in R with dplyr....307
10.1.1.3 Machine learning with xgboost for sales prediction....307
10.1.2 Handling and analyzing large sales data with Python....307
10.1.2.1 Importing and handling large sales data with dask....307
10.1.2.2 Parallelizing machine learning with joblib....308
10.1.2.3 Visualizing sales trends with plotly....308
10.1.3 Total sales by product category using MapReduce....309
Processing sales streaming data with MapReduce....309
10.2 End-to-end tutorials for implementing big data solutions....310
10.2.1 Case study: healthcare data for disease prediction....310
10.3 Debugging and optimization strategies....314
10.3.1 Debugging strategies for big data workflows....314
10.3.2 Optimizing data processing at scale....314
10.3.3 Optimizing model training and evaluation for big data....315
10.3.4 Deployment and monitoring optimization for big data solutions....316
Exercise....316
References....317
11 Emerging trends and future directions....318
11.1 AI, Edge computing, and IoT integration....318
11.1.1 Introduction to the integration of AI, Edge, and IoT in big data....318
11.1.2 Role of AI in enhancing data-driven intelligence....319
11.1.3 Edge computing for low-latency, local data processing....319
11.1.4 IoT as a generator of continuous, real-time data streams....320
11.1.5 Real-world integration: smart cities, autonomous systems, and predictive maintenance....320
11.1.6 Edge-cloud collaboration for scalable, distributed analytics and associated challenges....321
11.2 Real-time analytics with cloud computing....321
11.2.1 Definition and need for real-time analytics in modern enterprises....321
11.2.2 Cloud as an enabler: scalability, elasticity, and on-demand compute power....321
11.2.3 Stream processing frameworks....322
11.2.4 Use cases in real-time analytics....322
11.3 Future research directions in big data....324
11.3.1 Quantum computing....324
11.3.2 Ethical data analytics....324
11.3.3 Privacy-preserving technologies....325
11.3.4 Open research questions and emerging domains in big data....325
Exercise....326
References....326
Nomenclature....328
Glossary....330
Features of the book....332
Index....334
Back Cover....342
Essentials of Big Data Analytics: Applications in R and Python is a comprehensive guide that demystifies the complex world of big data analytics, blending theoretical concepts with hands-on practices using the Python and R programming languages and MapReduce framework. This book bridges the gap between theory and practical implementation, providing clear and practical understanding of the key principles and techniques essential for harnessing the power of big data. Essentials of Big Data Analytics is designed to provide a comprehensive resource for readers looking to deepen their understanding of Big Data analytics, particularly within a computer science, engineering, and data science context. By bridging theoretical concepts with practical applications, the book emphasizes hands-on learning through exercises and tutorials, specifically utilizing R and Python. Given the growing role of Big Data in industry and scientific research, this book serves as a timely resource to equip professionals with the skills needed to thrive in data-driven environments.
Computer Science researchers, data science researchers, and data analysis researchers in academia and industry. The primary audience also includes researchers and professionals in the fields of mathematics, AI, ML, deep learning and those who want to enhance their skills in data mining and analysis