Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework...
Transcript of Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework...
![Page 1: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/1.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
![Page 2: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/2.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
CHAPTER 25
Big Data Technologies Basedon MapReduce and Hadoop
![Page 3: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/3.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Introduction
n Phenomenal growth in data generationn Social median Sensorsn Communications networks and satellite imageryn User-specific business data
n “Big data” refers to massive amounts of datan Exceeds the typical reach of a DBMS
n Big data analytics
Slide 25- 3
![Page 4: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/4.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
25.1 What is Big Data?
n Big data ranges from terabytes (1012 bytes) or petabytes (1015 bytes) to exobytes (1018 bytes)
n Volumen Refers to size of data managed by the system
n Velocityn Speed of data creation, ingestion, and processing
n Varietyn Refers to type of data sourcen Structured, unstructured
Slide 25- 4
![Page 5: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/5.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
What is Big Data? (cont’d.)
n Veracityn Credibility of the sourcen Suitability of data for the target audiencen Evaluated through quality testing or credibility
analysis
Slide 25- 5
![Page 6: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/6.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
25.2 Introduction to MapReduce and Hadoop
n Core components of Hadoopn MapReduce programming paradigmn Hadoop Distributed File System (HDFS)
n Hadoop originated from quest for open source search enginen Developed by Cutting and Carafella in 2004n Cutting joined Yahoo in 2006n Yahoo spun off Hadoop-centered company in
2011n Tremendous growth
Slide 25- 6
![Page 7: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/7.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Introduction to MapReduce and Hadoop (cont’d.)
n MapReducen Fault-tolerant implementation and runtime
environmentn Developed by Dean and Ghemawat at Google in
2004n Programming style: map and reduce tasks
n Automatically parallelized and executed on large clusters of commodity hardware
n Allows programmers to analyze very large datasets
n Underlying data model assumed: key-value pairSlide 25- 7
![Page 8: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/8.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The MapReduce Programming Model
n Mapn Generic function that takes a key of type K1 and
value of type V1n Returns a list of key-value pairs of type K2 and V2
n Reducen Generic function that takes a key of type K2 and a
list of values V2 and returns pairs of type (K3, V3)n Outputs from the map function must match the
input type of the reduce function
Slide 25- 8
![Page 9: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/9.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The MapReduce Programming Model (cont’d.)
Slide 25-9
Figure 25.1 Overview of MapReduce execution (Adapted from T. White, 2012)
![Page 10: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/10.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The MapReduce Programming Model (cont’d.)
n MapReduce examplen Make a list of frequencies of words in a documentn Pseudocode
Slide 25- 10
![Page 11: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/11.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The MapReduce Programming Model (cont’d.)
n MapReduce example (cont’d.)n Actual MapReduce code
Slide 25- 11
![Page 12: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/12.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The MapReduce Programming Model (cont’d.)
n Distributed grepn Looks for a given pattern in a filen Map function emits a line if it matches a supplied
patternn Reduce function is an identity function
n Reverse Web-link graphn Outputs (target URL, source URL) pairs for each
link to a target page found in a source page
Slide 25- 12
![Page 13: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/13.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The MapReduce Programming Model (cont’d.)
n Inverted indexn Builds an inverted index based on all words
present in a document repositoryn Map function parses each document
n Emits a sequence of (word, document_id) pairsn Reduce function takes all pairs for a given word
and sorts them by document_idn Job
n Code for Map and Reduce phases, a set of artifacts, and properties
Slide 25- 13
![Page 14: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/14.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The MapReduce Programming Model (cont’d.)
n Hadoop releasesn 1.x features
n Continuation of the original code basen Additions include security, additional HDFS and
MapReduce improvementsn 2.x features
n YARN (Yet Another Resource Navigator)n A new MR runtime that runs on top of YARNn Improved HDFS that supports federation and
increased availability
Slide 25- 14
![Page 15: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/15.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
25.3 Hadoop Distributed File System (HDFS)
n HDFSn File system component of Hadoopn Designed to run on a cluster of commodity
hardwaren Patterned after UNIX file systemn Provides high-throughput access to large datasetsn Stores metadata on NameNode servern Stores application data on DataNode servers
n File content replicated on multiple DataNodes
Slide 25- 15
![Page 16: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/16.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Hadoop Distributed File System (cont’d.)
n HDFS design assumptions and goalsn Hardware failure is the normn Batch processingn Large datasetsn Simple coherency model
n HDFS architecturen Master-slaven Decouples metadata from data operationsn Replication provides reliability and high availabilityn Network traffic minimized
Slide 25- 16
![Page 17: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/17.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Hadoop Distributed File System (cont’d.)
n NameNoden Maintains image of the file system
n i-nodes and corresponding block locationsn Changes maintained in write-ahead commit log
called Journaln Secondary NameNodes
n Checkpointing role or backup rolen DataNodes
n Stores blocks in node’s native file systemn Periodically reports state to the NameNode
Slide 25- 17
![Page 18: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/18.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Hadoop Distributed File System (cont’d.)
n File I/O operationsn Single-writer, multiple-reader modeln Files cannot be updated, only appendedn Write pipeline set up to minimize network
utilizationn Block placement
n Nodes of Hadoop cluster typically spread across many racks
n Nodes on a rack share a switch
Slide 25- 18
![Page 19: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/19.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Hadoop Distributed File System (cont’d.)
n Replica managementn NameNode tracks number of replicas and block
locationn Based on block reports
n Replication priority queue contains blocks that need to be replicated
n HDFS scalabilityn Yahoo cluster achieved 14 petabytes, 4000 nodes,
15k clients, and 600 million files
Slide 25- 19
![Page 20: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/20.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The Hadoop Ecosystem
n Related projects with additional functionalityn Pig and hive
n Provides higher-level interface for working with Hadoop framework
n Oozien Service for scheduling and running workflows of
jobsn Sqoop
n Library and runtime environment for efficiently moving data between relational databases and HDFS
Slide 25- 20
![Page 21: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/21.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The Hadoop Ecosystem (cont’d.)
n Related projects with additional functionality (cont’d.)n HBase
n Column-oriented key-value store that uses HDFS
Slide 25- 21
![Page 22: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/22.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
25.4 MapReduce: Additional Details
n MapReduce runtime environmentn JobTracker
n Master processn Responsible for managing the life cycle of Jobs and
scheduling Tasks on the clustern TaskTracker
n Slave processn Runs on all Worker nodes of the cluster
Slide 25- 22
![Page 23: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/23.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MapReduce: Additional Details (cont’d.)
n Overall flow of a MapReduce jobn Job submissionn Job initializationn Task assignmentn Task executionn Job completion
Slide 25- 23
![Page 24: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/24.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MapReduce: Additional Details (cont’d.)
n Fault tolerance in MapReducen Task failure
n Runtime exceptionn Java virtual machine crashn No timely updates from the task process
n TaskTracker failuren Crash or disconnection from JobTrackern Failed Tasks are rescheduled
n JobTracker failuren Not a recoverable failure in Hadoop v1
Slide 25- 24
![Page 25: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/25.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MapReduce: Additional Details (cont’d.)
n The shuffle proceduren Reducers get all the rows for a given key togethern Map phase
n Background thread partitions buffered rows based on the number of Reducers in the job and the Partitioner
n Rows sorted on key valuesn Comparator or Combiner may be used
n Copy phasen Reduce phase
Slide 25- 25
![Page 26: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/26.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MapReduce: Additional Details (cont’d.)
n Job schedulingn JobTracker schedules work on cluster nodesn Fair Scheduler
n Provides fast response time to small jobs in a Hadoop shared cluster
n Capacity Schedulern Geared to meet needs of large enterprise
customers
Slide 25- 26
![Page 27: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/27.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MapReduce: Additional Details (cont’d.)
n Strategies for equi-joins in MapReduce environmentn Sort-merge joinn Map-side hash joinn Partition joinn Bucket joinsn N-way map-side joinsn Simple N-way joins
Slide 25- 27
![Page 28: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/28.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MapReduce: Additional Details (cont’d.)
n Apache Pign Bridges the gap between declarative-style
interfaces such as SQL, and rigid style required by MapReduce
n Designed to solve problems such as ad hoc analyses of Web logs and clickstreams
n Accommodates user-defined functions
Slide 25- 28
![Page 29: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/29.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MapReduce: Additional Details (cont’d.)
n Apache Hiven Provides a higher-level interface to Hadoop using
SQL-like queriesn Supports processing of aggregate analytical
queries typical of data warehousesn Developed at Facebook
Slide 25- 29
![Page 30: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/30.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Hive System Architecture and Components
Slide 25-30
Figure 25.2 Hive system architecture and components
![Page 31: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/31.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Advantages of the Hadoop/MapReduce Technology
n Disk seek rate a limiting factor when dealing with very large data setsn Limited by disk mechanical structure
n Transfer speed is an electronic feature and increasing steadily
n MapReduce processes large datasets in paralleln MapReduce handles semistructured data and
key-value datasets more easilyn Linear scalability
Slide 25- 31
![Page 32: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/32.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
25.5 Hadoop v2 (Alias YARN)
n Reasons for developing Hadoop v2n JobTracker became a bottleneckn Cluster utilization less than desirablen Different types of applications did not fit into the
MR modeln Difficult to keep up with new open source versions
of Hadoop
Slide 25- 32
![Page 33: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/33.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
YARN Architecture
n Separates cluster resource management from Jobs management
n ResourceManager and NodeManager together form a platform for hosting any application on YARN
n ApplicationMasters send ResourceRequests to the ResourceManager which then responds with cluster Container leases
n NodeManager responsible for managing Containers on their nodes
Slide 25- 33
![Page 34: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/34.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Hadoop Version Schematics
Slide 25-34
Figure 25.3 The Hadoop v1 vs. Hadoop v2 schematic
![Page 35: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/35.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Other Frameworks on YARN
n Apache Tezn Extensible framework being developed at
Hortonworks for building high-performance applications in YARN
n Apache Giraphn Open-source implementation of Google’s Pregel
system, a large-scale graph processing system used to calculate Page-Rank
n Hoya: HBase on YARNn More flexibility and improved cluster utilization
Slide 25- 35
![Page 36: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/36.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
25.6 General Discussion
n Hadoop/MapReduce versus parallel RDBMSn 2009: performance of two approaches measured
n Parallel database took longer to tune compared to MR
n Performance of parallel database 3-6 times faster than MR
n MR improvements since 2009n Hadoop has upfront cost advantage
n Open source platform
Slide 25- 36
![Page 37: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/37.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
General Discussion (cont’d.)
n MR able to handle semistructured datasetsn Support for unstructured data on the rise in
RDBMSsn Higher level language support
n SQL for RDBMSsn Hive has incorporated SQL features in HiveQL
n Fault-tolerance: advantage of MR-based systems
Slide 25- 37
![Page 38: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/38.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
General Discussion (cont’d.)
n Big data somewhat dependent on cloud technology
n Cloud model offers flexibilityn Scaling out and scaling upn Distributed software and interchangeable
resourcesn Unpredictable computing needs not uncommon in
big data projectsn High availability and durability
Slide 25- 38
![Page 39: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/39.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
General Discussion (cont’d.)
n Data locality issuesn Network load a concernn Self-configurable, locality-based data and virtual
machine management framework proposedn Enables access of data locally
n Caching techniques also improve performancen Resource optimization
n Challenge: optimize globally across all jobs in the cloud rather than per-job resource optimizations
Slide 25- 39
![Page 40: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/40.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
General Discussion (cont’d.)
n YARN as a data service platformn Emerging trend: Hadoop as a data lake
n Contains significant portion of enterprise datan Processing happens
n Support for SQL in Hadoop is improvingn Apache Storm
n Distributed scalable streaming enginen Allows users to process real-time data feeds
n Storm on YARN and SAS on YARN
Slide 25- 40
![Page 41: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/41.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
General Discussion (cont’d.)
n Challenges faced by big data technologiesn Heterogeneity of informationn Privacy and confidentialityn Need for visualization and better human interfacesn Inconsistent and incomplete information
Slide 25- 41
![Page 42: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/42.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
General Discussion (cont’d.)
n Building data solutions on Hadoopn May involve assembling ETL (extract, transform,
load) processing, machine learning, graph processing, and/or report creation
n Programming models and metadata not unifiedn Analytics application developers must try to
integrate services into coherent solutionn Cluster a vast resource of main memory and flash
storagen In-memory data enginesn Spark platform from Databricks
Slide 25- 42
![Page 43: Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe · n Apache Tez n Extensible framework being developed at Hortonworks for building high-performance applications in YARN n](https://reader034.fdocuments.net/reader034/viewer/2022042302/5ecddb73bb8ca502193da6bc/html5/thumbnails/43.jpg)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
25.7 Summary
n Big data technologies at the center of data analytics and machine learning applications
n MapReducen Hadoop Distributed File Systemn Hadoop v2 or YARN
n Generic data services platformn MapReduce/Hadoop versus parallel DBMSs
Slide 25- 43