Essentials of Big Data Analytics: Applications in R and Python

Essentials of Big Data Analytics: Applications in R and Python

Автор: Chavan Pallavi Vijay , Mangrulkar Ramchandra , Pampattiwar Kalyani

Дата выхода: 2026

Издательство: Elsevier Inc.

Количество страниц: 342

Размер файла: 6,4 МБ

Тип файла: PDF

Добавил: codelibs

Проверить на вирусы

Front Cover....1

Essentials of Big Data Analytics....4

Contents....8

Preface....12

Acknowledgments....14

Introduction....16

1 Introduction to big data analytics....18

1.1 Understanding big data....18

1.1.1 Definition and characteristics of big data....18

1.1.2 Volume, velocity, variety, veracity, and value (5Vs)....19

1.1.3 Real-time processing challenges....23

1.2 Types of big data....25

1.2.1 Classifying data into structured, unstructured, and semi-structured types....25

1.2.2 Examples of each type in various industries....29

1.3 Significance and applications of big data analytics....31

1.3.1 Discussing the importance of deriving insights from big data....32

1.3.2 Applications in business, healthcare, finance, and more....32

1.3.3 Impact on decision-making and strategic planning....35

1.4 Basics of data science....36

1.4.1 Core principles and goals of data science....36

1.4.2 The data science lifecycle....37

1.4.3 Role of a data scientist....39

1.4.4 Big data and data science: a symbiotic connection....40

Exercise....41

References....42

2 Mathematical foundations....44

2.1 Statistical concepts for big data....44

2.1.1 Review of statistical fundamentals....44

2.1.2 Adaptations for handling large datasets....66

2.1.3 Significance testing and confidence intervals in big data....66

2.1.3.1 Significance testing in big data....67

2.1.3.2 Confidence intervals in big data....68

2.2 R and Python fundamentals....69

2.2.1 Basic syntax, data types, structures....70

Variables and assignments....70

Basic arithmetic operations....71

Vectorized operations (efficient computations)....71

String operations....72

Functions and basic operations in R and Python....72

Lambda functions (anonymous functions)....72

Handling missing values....73

2.2.2 Data frames, lists, matrices, vectors, and arrays....73

Data frames in R and Python....73

Lists in R and Python....74

Vectors in R and Python....75

Matrices and arrays....76

Handling large data frames in R and Python....78

2.3 Data exploration and visualization....79

2.3.1 Exploratory data analysis (EDA) with R and Python....80

2.3.2 Visualizing data using ggplot2, Matplotlib, Seaborn, and Plotly....87

ggplot2 (R)....87

Matplotlib (Python)....87

Seaborn (Python)....87

Plotly (Python & R)....88

2.3.3 Interpretation of visualizations....88

Exercise....88

References....89

3 Big data technologies and programming....90

3.1 Overview of big data technologies (Hadoop, Spark)....90

3.1.1 Introduction to Hadoop framework....90

Hadoop: distributed storage and processing....90

Hadoop Distributed File System (HDFS)....91

Hadoop components....96

3.1.2 Introduction to Spark framework....98

Advantages of Spark over Hadoop MapReduce....100

3.1.3 Use cases for each technology....100

Industry use cases for Hadoop framework....100

Industry use cases for Spark framework....101

3.2 Introduction to MapReduce....102

3.2.1 Key concepts: Map phase, Shuffle and Sort, Reduce phase....102

3.2.2 MapReduce programming model....103

Example: multiply two 2x2 matrices using MapReduce....106

3.3 R and Python as programming languages for big data....107

3.3.1 Capabilities for handling large datasets....107

3.3.2 Integrating R and Python with big data tools....108

3.4 Using Python with Hadoop streaming for word count....108

3.5 Integrating R and Python with distributed computing....110

3.5.1 Challenges of distributed R and Python computing....114

Exercise....115

References....115

4 Data ingestion and preprocessing....116

4.1 Data collection strategies....116

4.1.1 Strategies for collecting diverse data sources....116

4.1.2 Challenges in data collection and solutions....120

4.2 Data cleaning and preprocessing....121

4.2.1 Techniques for cleaning noisy or inconsistent data....121

4.2.2 Techniques for preprocessing data....133

4.2.3 Feature engineering....139

4.2.4 Feature transformation (dimensionality reduction)....145

Exercise....154

References....154

5 Big data storage and management....156

5.1 Storage architectures for big data....156

5.1.1 Overview of storage solutions like HDFS and distributed databases....156

Comparison of storage systems in big data....161

5.1.2 Choosing storage solutions based on use cases....162

Decision factors in choosing a big data storage solution....162

Use case examples....164

Use case 1: transactional data processing....164

Characteristics of transactional data processing....164

Storage solutions for transactional data processing....164

Deciding factors for choosing the right storage solution....165

Use case 2: analytical processing....165

Characteristics of analytical processing....165

Storage solutions for analytical processing....165

Deciding factors for choosing the right storage solution....166

Use case 3: data archival....166

Characteristics of data archival....166

Storage solutions for data archival....167

Deciding factors for choosing the right storage solution....167

5.2 Scalable data management....167

5.2.1 Scalability challenges and solutions....168

Solutions and future directions....169

5.2.2 Horizontal and vertical scaling concepts....169

Horizontal scaling in big data....169

Vertical scaling in big data....170

Hybrid scaling....171

Practical examples....171

5.3 Data warehousing and data lakes....172

5.3.1 Understanding data warehousing and data lakes....172

Data warehouses....172

Data warehouse characteristics....173

Why use a data warehouse?....173

Data lakes....173

Data lake characteristics....173

Why use a data lake?....174

HDFS as data lakes....174

HDFS as the foundation of data lakes....174

Key characteristics of HDFS in data lakes....174

Schema-on-read model in HDFS....175

Real-world example: Netflix and HDFS....175

Differences between data warehouses and data lakes....175

5.3.2 Integrating R and Python in analytics on data lakes....176

Using R and Python for data manipulation....176

Python libraries for big data analytics....176

R libraries for big data analytics....176

Connecting R/Python to HDFS and data lakes for data manipulation, statistical analysis, and visualization....177

Python code – connecting to data lakes (PyArrow & HDFSClient)....177

R code – connecting to data lakes (sparklyr and rhdfs)....178

Data manipulation in data lakes with R and Python....179

Machine learning in data lakes....179

Python for machine learning in data lakes....179

R for machine learning in data lakes....182

Data science with big data: use cases....185

Analytics on streaming data: real-time analytics with R/Python....186

Real-time analytics with Python....186

Real-time analytics with R....186

5.4 Case studies: practical implementations using Python....186

5.4.1 Retail company data lake case study....186

5.4.2 Financial institution data warehouse case study....189

Exercise....193

References....193

6 Advanced MapReduce for big data processing....194

6.1 Understanding MapReduce paradigm....194

6.1.1 Deep dive into the MapReduce framework....194

6.1.2 Practical use cases for MapReduce....200

6.1.2.1 Healthcare analytics....200

6.1.2.2 Financial risk assessment....201

6.2 Implementing MapReduce jobs....202

6.2.1 Step-by-step guide on writing and executing a MapReduce job....202

6.2.2 Common patterns and anti-patterns in MapReduce development....205

6.2.3 Anti-patterns in MapReduce development....208

6.3 MapReduce optimization techniques....209

6.3.1 Strategies for optimizing MapReduce jobs....209

6.3.2 Combiners, partitioning, and compression techniques....210

Exercise....211

References....211

7 Machine learning techniques for big data processing....212

7.1 Introduction to machine learning in big data context....212

7.1.1 What is machine learning?....212

Formal definition....212

Learning paradigms in machine learning....212

Big data relevance....213

7.1.2 Role of machine learning in big data analytics....213

Case study: predictive maintenance using supervised learning....214

Dataset description....214

Steps to download kaggle.json from Kaggle....215

7.1.3 Machine learning vs traditional statistical approaches....218

7.1.4 The machine learning pipeline....218

Benefits of machine learning pipeline....218

Steps to build a machine learning pipeline....219

Implementation for model training....219

7.1.5 Challenges in applying ML to big data....221

7.2 Supervised learning for big data....221

7.2.1 Overview of supervised learning....221

How supervised learning works?....222

Types of supervised learning in machine learning....222

7.2.2 Regression techniques....222

7.2.3 Classification techniques....225

7.2.4 Model evaluation and metrics....230

7.2.5 Applications in finance, healthcare, and risk management....233

7.2.6 Scalable implementations using Spark MLlib / TensorFlow....235

7.3 Unsupervised learning for big data....240

7.3.1 Introduction to unsupervised learning....240

7.3.2 Clustering techniques....240

7.3.3 Dimensionality reduction....245

What is dimensionality?....245

Why dimensionality reduction?....246

Problems in high-dimensional spaces....246

What is dimensionality reduction?....246

Two main approaches....246

Variants of autoencoders useful in big data....248

7.4 Optimization techniques in big data processing....250

7.4.1 Introduction to optimization in big data....250

Role in scalable analytics....251

Optimization in big data pipelines and ML workflows....251

7.4.2 Types of optimization techniques....251

7.4.3 Linear programming (LP)....252

Applications of LP in big data....252

Limitations of traditional LP solvers....254

Scaling LP for big data....254

7.4.4 Dynamic programming (DP)....256

Problem formulation: dynamic programming....256

Key properties of DP problems....256

Generic DP formulation....257

Example: Fibonacci recurrence....257

Big data perspective....257

Scaling dynamic programming for big data: a real-world perspective....258

Output....260

Replication distribution visualization....260

7.4.5 Goal programming (GP)....261

Mathematical formulation....261

Example: multi-objective scheduling in a big data cluster....261

Mathematical formulation....262

Scalable goal programming in big data....264

How to achieve scalability in goal programming....265

Exercise....267

References....268

8 Mining data streams....270

8.1 The stream data model....270

8.1.1 A data-stream-management system....270

Architecture of DSMS....271

8.1.2 Examples of stream sources, stream queries....271

Stream queries....272

Issues in data stream query processing....274

8.2 Sampling and filtering in data streams....274

8.2.1 Sampling data in streams....274

Varying the sample size....275

8.2.2 Filtering in data streams....275

Types of filtering....275

8.3 Algorithms for approximate data stream processing....276

8.3.1 Counting distinct elements in a stream....277

The Flajolet Martin algorithm....277

8.3.2 Counting ones in a window....279

The Datar-Gionis-Indyk-Motwani (DGIM) algorithm....279

8.3.3 Bloom filters and their analysis....282

Probability of false positivity....284

Size of bit array....284

Space efficiency....284

Choice of Hash function....285

Exercise....285

References....286

9 Case studies and practical applications....288

9.1 Industry-specific use cases....288

9.1.1 Applications in manufacturing, transportation and retail....288

9.1.1.1 Case study: GE predictive maintenance in aviation....288

Logistic regression for predictive maintenance....289

9.1.1.2 Case study: UPS Orion project....291

9.1.1.3 Case study: Walmart’s real-time replenishment system....294

9.2 Success stories in big data analytics....296

9.2.1 Vodafone – enhancing customer retention through unified analytics....297

9.2.2 CS energy – smart grid modernization using big data analytics....298

9.3 Practical implementations and challenges....300

9.3.1 Implementing solutions using R and Python....301

9.3.2 Addressing real-world challenges....304

Exercise....304

References....304

10 Hands-on exercises and tutorials with R, Python and MapReduce....306

10.1 Coding examples in R, Python, and MapReduce....306

10.1.1 Handling and analyzing large sales data with R....306

10.1.1.1 Importing and manipulating large sales data in R using data.table....306

10.1.1.2 Data transformation and aggregation in R with dplyr....307

10.1.1.3 Machine learning with xgboost for sales prediction....307

10.1.2 Handling and analyzing large sales data with Python....307

10.1.2.1 Importing and handling large sales data with dask....307

10.1.2.2 Parallelizing machine learning with joblib....308

10.1.2.3 Visualizing sales trends with plotly....308

10.1.3 Total sales by product category using MapReduce....309

Processing sales streaming data with MapReduce....309

10.2 End-to-end tutorials for implementing big data solutions....310

10.2.1 Case study: healthcare data for disease prediction....310

10.3 Debugging and optimization strategies....314

10.3.1 Debugging strategies for big data workflows....314

10.3.2 Optimizing data processing at scale....314

10.3.3 Optimizing model training and evaluation for big data....315

10.3.4 Deployment and monitoring optimization for big data solutions....316

Exercise....316

References....317

11 Emerging trends and future directions....318

11.1 AI, Edge computing, and IoT integration....318

11.1.1 Introduction to the integration of AI, Edge, and IoT in big data....318

11.1.2 Role of AI in enhancing data-driven intelligence....319

11.1.3 Edge computing for low-latency, local data processing....319

11.1.4 IoT as a generator of continuous, real-time data streams....320

11.1.5 Real-world integration: smart cities, autonomous systems, and predictive maintenance....320

11.1.6 Edge-cloud collaboration for scalable, distributed analytics and associated challenges....321

11.2 Real-time analytics with cloud computing....321

11.2.1 Definition and need for real-time analytics in modern enterprises....321

11.2.2 Cloud as an enabler: scalability, elasticity, and on-demand compute power....321

11.2.3 Stream processing frameworks....322

11.2.4 Use cases in real-time analytics....322

11.3 Future research directions in big data....324

11.3.1 Quantum computing....324

11.3.2 Ethical data analytics....324

11.3.3 Privacy-preserving technologies....325

11.3.4 Open research questions and emerging domains in big data....325

Exercise....326

References....326

Nomenclature....328

Glossary....330

Features of the book....332

Index....334

Back Cover....342

Essentials of Big Data Analytics: Applications in R and Python is a comprehensive guide that demystifies the complex world of big data analytics, blending theoretical concepts with hands-on practices using the Python and R programming languages and MapReduce framework. This book bridges the gap between theory and practical implementation, providing clear and practical understanding of the key principles and techniques essential for harnessing the power of big data. Essentials of Big Data Analytics is designed to provide a comprehensive resource for readers looking to deepen their understanding of Big Data analytics, particularly within a computer science, engineering, and data science context. By bridging theoretical concepts with practical applications, the book emphasizes hands-on learning through exercises and tutorials, specifically utilizing R and Python. Given the growing role of Big Data in industry and scientific research, this book serves as a timely resource to equip professionals with the skills needed to thrive in data-driven environments.

Key features

Includes hands-on Tutorials and Case Studies: Structured exercises and real-world examples reinforce learning and skill-building
Focuses on Python and R for Big Data: Detailed lessons in Python and R programming cater to the increasing demand for data science expertise
Balanced Theory and Practice: Comprehensive coverage ensures a strong theoretical foundation paired with actionable insights for real-world application

Readership

Computer Science researchers, data science researchers, and data analysis researchers in academia and industry. The primary audience also includes researchers and professionals in the fields of mathematics, AI, ML, deep learning and those who want to enhance their skills in data mining and analysis

Если вам понравилась эта страница - поделитесь ею с друзьями, тем самым вы помогаете нам развиваться и добавлять всё больше интересных и нужным вам книг