Pro Apache Hadoop - GBV · CONTENTS Chapter2: HadoopConcepts 11 Introducing Hadoop 11...
Transcript of Pro Apache Hadoop - GBV · CONTENTS Chapter2: HadoopConcepts 11 Introducing Hadoop 11...
Pro Apache Hadoop
Second Edition
Sameer Wadkar
Madhu Siddalingaiah
Contents
J
About the Authors xix
About the Technical Reviewer xxi
Acknowledgments xxiii
Introduction xxv
Chapter 1: Motivation for Big Data 1
What Is Big Data? 1
Key Idea Behind Big Data Techniques 2
Data Is Distributed Across Several Nodes 2
Applications Are Moved to the Data 3
Data Is Processed Local to a Node 3
Sequential Reads Preferred Over Random Reads 3
An Example 4
Big Data Programming Models 4
Massively Parallel Processing (MPP) Database Systems 4
In-Memory Database Systems 5
MapReduce Systems 5
Bulk Synchronous Parallel (BSP) Systems 6
Big Data and Transactional Systems 7
How Much Can We Scale? 8
A Compute-Intensive Example 8
Amdhal's Law 9
Business Use-Cases for Big Data 9
Summary 10
vii
CONTENTS
Chapter 2: Hadoop Concepts 11
Introducing Hadoop 11
Introducing the MapReduce Model 12
Components of Hadoop 16
Hadoop Distributed File System (HDFS) 17
Secondary NameNode 22
TaskTracker 23
JobTracker 23
Hadoop 2.0 24
Components of YARN 26
HDFS High Availability 29
Summary 30
Chapter 3: Getting Started with the Hadoop Framework 31
Types of Installation 31
Stand-Alone Mode 31
Pseudo-Distributed Cluster 32
Multinode Node Cluster Installation 32
Preinstalled Using Amazon Elastic MapReduce 32
Setting up a Development Environment with a Cloudera Virtual Machine 33
Components of a MapReduce program 34
Your First Hadoop Program 34
Prerequisites to Run Programs in Local Mode 35
WordCount Using the Old API 36
Building the Application 38
Running WordCount in Cluster Mode 39
WordCount Using the New API 39
Building the Application 41
Running WordCount in Cluster Mode 41
Third-Party Libraries in Hadoop Jobs 41
Summary 46
viii
CONTENTS
Chapter 4: Hadoop Administration 47
Hadoop Configuration Files 47
Configuring Hadoop Daemons 48
Precedence of Hadoop Configuration Files 49
Diving into Hadoop Configuration Files 49
core-site.xml 50
hdfs-*.xml 51
mapred-site.xml 52
yarn-site.xml 54
Memory Allocations in YARN 55
Scheduler 56
Capacity Scheduler 57
Fair Scheduler 59
Fair Scheduler Configuration 60
yarn-site.xml Configurations 61
Allocation File Format and Configurations 62
Determine Dominant Resource Share in drf Policy 63
Slaves File 64
Rack Awareness 64
Providing Hadoop with Network Topology 64
Cluster Administration Utilities 65
Check the HDFS 66
Command-Line HDFS Administration 68
Rebalancing HDFS Data 70
Copying Large Amounts of Data from the HDFS 71
Summary 72
Chapter 5: Basics of MapReduce Development 73
Hadoop and Data Processing 73
Reviewing the Airline Dataset 73
Preparing the Development Environment 75
Preparing the Hadoop System 75
ix
CONTENTS
MapReduce Programming Patterns 76
Map-Only Jobs (SELECT and WHERE Queries) 76
Problem Definition: SELECT Clause 76
Problem Definition: WHERE Clause 84
Map and Reduce Jobs (Aggregation Queries) 87
Problem Definition: GROUP BY and SUM Clauses 88
Improving Aggregation Performance Using the Combiner 94
Problem Definition: Optimized Aggregators 95
Role of the Partitioner 100
Problem Definition: Split Airline Data by Month 100
Bringing it All Together 103
Summary 106
Chapter 6: Advanced MapReduce Development 107
MapReduce Programming Patterns 107
Introduction to Hadoop I/O 107
Problem Definition: Sorting 109
Problem Definition: Analyzing Consecutive Records 124
Problem Definition: Join Using MapReduce 134
Problem Definition: Join Using Map-Only jobs 140
Writing to Multiple Output Files in a Single MR Job 145
Collecting Statistics Using Counters 147
Summary 150
Chapter 7: Hadoop Input/Output 151
Compression Schemes 151
What Can Be Compressed? 152
Compression Schemes 152
Enabling Compression 153
Inside the Hadoop I/O processes 154
InputFormat 155
OutputFormat 156
Custom OutputFormat: Conversion from Text to XML 157
x
CONTENTS
Custom InputFormat: Consuming a Custom XML file 161
Hadoop Files 170
SequenceFile 171
MapFiles 176
Avro Files 177
Summary 183
Chapter 8: Testing Hadoop Programs 185
Revisiting the Word Counter 185
Introducing MRUnit 187
Installing MRUnit 187
MRUnit Core Classes 187
Writing an MRUnit Test Case 188
Testing Counters 190
Features of MRUnit 193
Limitations of MRUnit 194
Testing with LocalJobRunner 194
Limitations of LocalJobRunner 197
Testing with MiniMRCIuster 197
Setting up the Development Environment 197
Example for MiniMRCIuster 199
Limitations of MiniMRCIuster 201
Testing MR Jobs with Access Network Resources 201
Summary 202
Chapter 9: Monitoring Hadoop 203
Writing Log Messages in Hadoop MapReduce Jobs 203
Viewing Log Messages in Hadoop MapReduce Jobs 206
User Log Management in Hadoop 2.x 209
Log Storage in Hadoop 2.x 209
Log Management Improvements 211
Viewing Logs Using Web-Based Ul 211
xi
CONTENTS
Command-Line Interface 211
Log Retention 212
Hadoop Cluster Performance Monitoring 212
Using YARN REST APIs 213
Managing the Hadoop Cluster Using Vendor Tools 213
Ambari Architecture 214
Summary 215
Chapter 10: Data Warehousing Using Hadoop 217
Apache Hive 217
Installing Hive 218
Hive Architecture 218
Metastore 219
Compiler Basics 219
Hive Concepts 219
HiveQL Compiler Details 223
Data Definition Language 227
Data Manipulation Language 228
External Interfaces 229
Hive Scripts 231
Performance 232
MapReduce Integration 232
Creating Partitions 233
User-Defined Functions 234
Impala 236
ImpalaArchitecture 237
Impala Features 237
Impala Limitations 237
Shark 238
Shark/Spark Architecture 238
Summary 239
xii
CONTENTS
Chapter 11: Data Processing Using Pig 241
An Introduction to Pig 241
Running Pig 243
Executing in the Grunt Shell 244
Executing a Pig Script 244
Embedded Java Program 245
Pig Latin 246
Comments in a Pig Script 246
Execution of Pig Statements 247
Pig Commands 247
User-Defined Functions 252
Eval Functions Invoked in the Mapper 253
Eval Functions Invoked in the Reducer 253
Writing and Using a Custom FilterFunc 260
Comparison of PIG versus Hive 262
Crunch API 263
How Crunch Differs from Pig 263
Sample Crunch Pipeline 264
Summary 269
Chapter 12: HCatalog and Hadoop in the Enterprise 271
HCatalog and Enterprise Data Warehouse Users 271
HCatalog: A Brief Technical Background 272
HCatalog Command-Line Interface 274
WebHCat 274
HCatalog Interface for MapReduce 275
HCatalog Interface for Pig 278
HCatalog Notification Interface 279
Security and Authorization in HCatalog 279
Bringing It All Together 280
Summary 281
xiii
CONTENTS
Chapter 13: Log Analysis Using Hadoop 283
Log File Analysis Applications 283
Web Analytics 283
Security Compliance and Forensics 284
Monitoring and Alerts 284
Internet of Things 285
Analysis Steps 286
Load 286
Refine 286
Visualize 287
Apache Flume 287
Core Concepts 288
Netflix Suro 290
Cloud Solutions 291
Summary 291
Chapter 14: Building Real-Time Systems Using HBase 293
What Is HBase? 293
Typical HBase Use-Case Scenarios 294
HBase Data Model 295
HBase Logical or Client-Side View 295
Differences Between HBase and RDBMSs 296
HBase Tables 297
HBase Cells 297
HBase Column Family 297
HBase Commands and APIs 298
Getting a Command List: help Command 299
Creating a Table: create Command 300
Adding Rows to a Table: put Command 300
Retrieving Rows from the Table: get Command 300
Reading Multiple Rows: scan Command 300
xiv
CONTENTS
Counting the Rows in the Table: count Command 301
Deleting Rows: delete Command 301
Truncating a Table: truncate Command 301
Dropping a Table: drop Command 302
Altering a Table: alter Command 302
HBase Architecture 302
HBase Components 303
Compaction and Splits in HBase 309
Compaction 310
HBase Configuration: An Overview 311
hbase-defaultxml and hbase-site.xml 311
HBase Application Design 312
Tall vs. Wide vs. Narrow Table Design 312
Row Key Design 313
HBase Operations Using Java API 314
HBase Treats Everything as Bytes 314
Create an HBase Table 315
Administrative Functions Using HBaseAdmin 315
Accessing Data Using the Java API 316
HBase MapReduce Integration 320
A MapReduce Job to Read an HBase Table 320
HBase and MapReduce Clusters 323
Scenario I: Frequent MapReduce Jobs Against HBase Tables 323
Scenario II: HBase and MapReduce have Independent SLAs 323
Summary 323
Chapter 15: Data Science with Hadoop 325
Hadoop Data Science Methods 325
Apache Hama 326
Bulk Synchronous Parallel Model 326
Hama Hello World! 327
XV
CONTENTS
Monte Carlo Methods 329
K-Means Clustering 333
Apache Spark 336
Resilient Distributed Datasets (RDDs) 336
Monte Carlo with Spark 337
KMeans with Spark 339
RHadoop 341
Summary 342
Chapter 16: Hadoop in the Cloud 343
Economics 343
Self-Hosted Cluster 343
Cloud-Hosted Cluster 344
Elasticity 344
On Demand 344
Bid Pricing 345
Hybrid Cloud 345
Logistics 345
Ingress/Egress 345
Data Retention 345
Security 346
Cloud Usage Models 346
Cloud Providers 347
Amazon Web Services 347
Google Cloud Platform 349
Microsoft Azure 350
Choosing a Cloud Vendor 350
Case Study: Amazon Web Services 351
Elastic MapReduce 351
Elastic Compute Cloud 354
Summary 356
xvi
CONTENTS
Chapter 17: Building a YARN Application 357
YARN: A General-Purpose Distributed System 357
YARN: A Quick Review 359
Creating a YARN Application 361
POM Configuration 362
DownloadService.java Class 362
Clientjava 365
Steps to Launch the Application Master from the Client 365
ApplicationMaster.java 373
Communication Protocol between Application Master and Resource Manager:
Application Master Protocol 373
Node Manager Communication Protocol: Container Management Protocol 373
Steps to Launch the Worker Tasks 373
Executing the Application Master 378
Launch the Application in Un-Managed Mode 379
Launch the Application in Managed Mode 379
Summary 379
Appendix A: Installing Hadoop 381
Installing Hadoop 2.2.0 on Windows 381
Preparing the Installation Environment 381
Building Hadoop 2.2.0 for Windows 383
Installing Hadoop 2.2.0 for Windows 383
Configuring Hadoop 2.2.0 383
Preparing the Hadoop Cluster 386
Starting HDFS 387
Starting MapReduce (YARN) 387
Verifying that the Cluster Is Running 387
Testing the Cluster 387
Installing Hadoop 2.2.0 on Linux 388
xvii
CONTENTS
Appendix B: Using Maven with Eclipse 391
A Quick Introduction to Maven 391
Creating a Maven Project 391
Using Maven with Eclipse 393
Installing the m2e Maven Eclipse Plug-in 393
Creating a Maven Project from Eclipse 393
Building a Maven Project from Eclipse... 396
Appendix C: Apache Ambari 399
Hadoop Components Supported by Apache Ambari 399
Installing Apache Ambari 401
Trying the Ambari Sandbox on Your OS 401
Index 403
xviii