Big Data Management and Analytics

Big Data Management and Analytics

Big Data Management and Analytics
Автор: Brij B Gupta, Mamta
Дата выхода: 2024
Издательство: World Scientific Publishing Co Pte Ltd
Количество страниц: 288
Размер файла: 2.6 MB
Тип файла: PDF
Добавил: codelibs
 Проверить на вирусы

Contents....20

Foreword....8

Preface....12

About the Authors....16

Acknowledgments....18

List of Figures....26

List of Tables....30

Chapter 1 Introduction to Big Data....32

1.1 Data: The New Oil and the New Soil....33

1.2 What is Big Data and What are its Sources....34

1.2.1 Big Data Generated by Machines....35

1.2.2 Big Data Generated by Humans....36

1.2.3 Big Data Generated by Organizations....39

1.3 Characteristics of Big Data....42

1.3.1 Volume....43

1.3.2 Velocity....44

1.3.3 Variety....45

1.3.4 Veracity....46

1.3.5 Valence....46

1.3.6 Value....47

1.4 Importance of Big Data: Popular Use Cases....47

1.5 Chapter Summary....49

References....50

Chapter 2 Big Data Management and Modeling....52

2.1 Big Data Management....52

2.1.1 Data Acquisition/Ingestion....52

2.1.2 Data Storage....53

2.1.3 Data Quality....54

2.1.4 Data Operations....54

2.1.5 Data Scalability....54

2.1.6 Data Security....55

2.2 Challenges in Big Data Management: Case Study....55

2.3 Big Data Modeling....56

2.3.1 Data Model Structures....56

2.3.2 Data Model Operations....58

2.3.3 Data Model Constraints....61

2.4 Types of Data Models....61

2.4.1 Relational Data Model....62

2.4.2 Semi-Structured Data Model....63

2.4.3 Unstructured Data Model: Vector Space Data Model....65

2.4.4 Graph Data Model....69

2.5 Chapter Summary....71

References....72

Chapter 3 Big Data Processing....74

3.1 Requirements for Big Data Processing....74

3.2 Big Data Retrieval....76

3.2.1 Relational Data Query....76

3.2.2 JSON Data Query Using MongoDB and Aerospike....80

3.3 Big Data Integration....84

3.3.1 Big Data Integration Problems....87

3.4 Big Data Processing Pipeline....89

3.4.1 Data Transformation Operations in Big Data Processing Pipeline....90

3.4.1.1 Map and Reduce Operations....90

3.4.1.2 Aggregation Operations....91

3.4.1.3 Analytical Operations....92

3.5 Big Data Management and Processing Using Splunk and Datameer....94

3.5.1 Splunk....94

3.5.2 Datameer....95

3.6 Chapter Summary....95

References....96

Chapter 4 Big Data Analytics and Machine Learning....100

4.1 Introduction to Machine Learning....100

4.1.1 Machine Learning Techniques....102

4.2 Machine Learning Process....103

4.2.1 Acquire....106

4.2.2 Prepare....106

4.2.2.1 Exploratory Data Analysis (EDA)....106

4.2.2.1.1 Summary Statistics....107

4.2.2.1.2 Visualization Methods....111

4.2.2.2 Pre-Processing....118

4.2.2.2.1 Data Cleaning....119

4.2.2.2.1.1 Addressing Data Quality Issues....121

4.2.2.2.2 Feature Selection/Engineering....123

4.2.2.2.3 Feature Transformation....125

4.2.3 Analyze....130

4.2.3.1 Classification....130

4.2.3.1.1. Building and Applying a Classification Model....132

4.2.3.1.2 Classification Algorithms....133

4.2.4 Evaluation of Machine Learning Models....137

4.2.4.1 Evaluation Metrics....139

4.3 Scaling Up Machine Learning Algorithms....142

4.4 Chapter Summary....143

References....143

Chapter 5 Big Data Analytics Through Visualization....146

5.1 Graph Definition....146

5.1.1 Examples of Graph Analytics for Big Data....148

5.1.1.1 Social Media....148

5.1.1.2 Biological Networks....149

5.1.1.3 Personal Information Networks....151

5.2 Graph Analytics from the Perspective of Big Data....153

5.3 Techniques for Graph Analytics....155

5.3.1 Basic Definitions....155

5.3.2 Path Analytics....158

5.3.3 Connectivity Analytics....162

5.3.4 Community Analytics....163

5.3.5 Centrality Analytics....167

5.4 Large-Scale Graph Processing....167

5.4.1 Parallel Programming Model for Graphs....168

5.5 Chapter Summary....169

References....170

Chapter 6 Taming Big Data with Spark 2.0....174

6.1 Introduction to Spark 2.0....174

6.1.1 Why Spark 2.0 Replaced Hadoop....175

6.2 Resilient Distributed Datasets....176

6.3 Spark 2.0....178

6.3.1 Language Processing with Spark 2.0....185

6.3.2 Analysis of Streaming Data with Spark 2.0....186

6.3.3 Streaming API....187

6.3.4 Kafka....187

6.3.4.1 Kafka Streaming....188

6.3.5 Apache Spark Streaming....189

6.4 Spark Machine Learning Library....189

6.5 Chapter Summary....190

References....191

Chapter 7 Managing Big Data in Cloud Storage....194

7.1 Large-Scale Data Storage....195

7.1.1 Challenges of Storing Large Data in Distributed Systems....196

7.2 Hadoop Distributed File System (HDFS)....198

7.2.1 HDFS Permission Checks....198

7.2.2 HDFS Shell Commands....199

7.2.3 Chaining and Scripting HDFS Commands....212

7.2.4 Loading Data on HDFS....213

7.3 Hadoop User Experience (HUE)....215

7.3.1 Features of HUE....216

7.3.2 HUE Components....216

7.4 Chapter Summary....220

References....220

Chapter 8 Big Data in Healthcare....222

8.1 Digitalization in Healthcare Sector....223

8.1.1 Use of Big Data in Medical Care....224

8.2 Big Data in Public Health....224

8.2.1 Big Data Surveillance Using Machine Learning....224

8.2.2 Big Data in Public Health Training....225

8.2.3 Limitations and Open Issues for Big Data While Using Machine Learning in Public Health....227

8.3 The Four V’s of Big Data in Healthcare....228

8.4 Big Data in Genomics....231

8.5 Architectural Framework....233

8.5.1 Methodology of Big Data Analytics in Healthcare....235

8.5.2 Advantages of Big Data Analytics to Healthcare....236

8.5.3 Challenges of Big Data in Healthcare....237

8.6 Chapter Summary....238

References....238

Chapter 9 Big Data in Finance....242

9.1 Digitalization in Financial Industry....242

9.2 Sources of Financial Data....244

9.3 Challenges of Using Big Data in Financial Research....246

9.4 Financial Big Data....247

9.4.1 FBD Management....247

9.4.2 FBD Analytics....248

9.5 Theoretical Framework of Big Data in Financial Services....250

9.6 Popular Use Cases of FBD Analytics....250

9.7 Chapter Summary....253

References....254

Chapter 10 Enabling Tools and Technologies for Big Data Analytics....256

10.1 Big Data Management and Modeling Tools....257

10.1.1 Data Modeling Tools....258

10.1.2 Vector Data Model with Lucene....258

10.1.3 Graph Data Model with Gephi....259

10.1.4 Data Management Tools....260

10.1.4.1 Redis....261

10.1.4.2 Aerospike....261

10.1.4.3 AsterixDB....262

10.1.4.4 Solr....263

10.1.4.5 Vertica....263

10.2 Big Data Integration and Processing Tools....264

10.2.1 Big Data Processing Using Splunk and Datameer....265

10.3 Big Data Machine Learning Tools....267

10.3.1 KNIME....267

10.3.1.1 Exploring Data with KNIME Plots....269

10.3.1.2 Handling Missing Values in KNIME....272

10.3.1.3 Classification Using Decision Tree in KNIME....274

10.3.1.4 Evaluation of Decision Tree in KNIME....277

10.3.2 Spark MLlib....278

10.4 Big Data Graph Analytics Tools....278

10.4.1 Giraph....278

10.4.2 GraphX....279

10.4.3 Neo4j....280

10.5 Chapter Summary....281

References....282

Index....286

 With the proliferation of information, big data management and analysis have become an indispensable part of any system to handle such amounts of data. The amount of data generated by the multitude of interconnected devices increases exponentially, making the storage and processing of these data a real challenge.

 Big data management and analytics have gained momentum in almost every industry, ranging from finance or healthcare. Big data can reveal key insights if handled and analyzed properly; it has great application potential to improve the working of any industry. This book covers the spectrum aspects of big data; from the preliminary level to specific case studies. It will help readers gain knowledge of the big data landscape.

 Highlights of the topics covered include description of the Big Data ecosystem; real-world instances of big data issues; how the Vs of Big Data (volume, velocity, variety, veracity, valence, and value) affect data collection, monitoring, storage, analysis, and reporting; structural process to get value out of Big Data and recognize the differences between a standard database management system and a big data management system.

 Readers will gain insights into choice of data models, data extraction, data integration to solve large data problems, data modelling using machine learning techniques, Spark's scalable machine learning techniques, modeling a big data problem into a graph database and performing scalable analytical operations over the graph and different tools and techniques for processing big data and its applications including in healthcare and finance.


Похожее:

Список отзывов:

Нет отзывов к книге.