Internet of things Crash Course Workshop
-
Upload
hadoop-summit -
Category
Technology
-
view
268 -
download
1
Transcript of Internet of things Crash Course Workshop
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Real-Time Processing in HadoopHadoop Summit 2015
Ali BajwaPartner Solutions EngineerJune 2015
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Agenda
Introduction & about Hortonworks HDP Overview of logistics industry scenario Overview of streaming architecture on HDP Streaming Demo #1 Integrating Predictive Analytics in streaming scenarios Streaming Demo with Predictive additions Q & A
Page 2
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Preface: Enabling Technologies
Page 5
• Problems solved at scale, via fundamentally new approaches…• Make it possible, even simple, to produce new products/applications that would have been too cost prohibitive – or simply impossible - beforehand.
• Where foundation tech like Li-Ion batteries, retina displays, GPS & tiny HD cameras (from smartphones) have enabled Electric cars, quad-copters, VR displays, & more…
• Hadoop has similarly led to breakthroughs in big data scale & capability, and enables new real-time advanced analytic applications.
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Why did Hadoop emerge?
April 2015
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Traditional systems under pressure
Challenges• Constrains data to app
• Can’t manage new data
• Costly to Scale
Business Value
Clickstream
Geolocation
Web Data
Internet of Things
Docs, emails
Server logs
20122.8 Zettabytes
202040 Zettabytes
LAGGARDS
INDUSTRY LEADERS
1
2 New Data
ERP CRM SCM
New
Traditional
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP
Spring 2015
Hortonworks. We do Hadoop.
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP
Customer Momentum• 330+ customers (as of year-end 2014)
Hortonworks Data Platform• Completely open multi-tenant platform for any app & any data.
• A centralized architecture of consistent enterprise services for resource management, security, operations, and governance.
Partner for Customer Success• Open source community leadership focus on enterprise needs
• Unrivaled world class support
• Founded in 2011
• Original 24 architects, developers, operators of Hadoop from Yahoo!
• 600+ Employees
• 1000+ Ecosystem Partners
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Customer Partnerships matterDriving our innovation through
Apache Software Foundation Projects
Apache Project Committers PMC Members
Hadoop 27 21
Pig 5 5
Hive 18 6
Tez 16 15
HBase 6 4
Phoenix 4 4
Accumulo 2 2
Storm 3 2
Slider 11 11
Falcon 5 3
Flume 1 1
Sqoop 1 1
Ambari 34 27
Oozie 3 2
Zookeeper 2 1
Knox 13 3
Ranger 10 n/a
TOTAL 161 108Source: Apache Software Foundation. As of 11/7/2014.
Hortonworkers are the architects and engineers that lead development of open source Apache Hadoop at the ASF
• ExpertiseUniquely capable to solve the most complex issues & ensure success with latest features
• ConnectionProvide customers & partners direct input into the community roadmap
• PartnershipWe partner with customers with subscription offering. Our success is predicated on yours.
27
Cloudera: 11
Facebook: 5
LinkedIn: 2
IBM: 2
Others: 23
Yahoo10
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Technology Partnerships matter
Apache Project Hortonworks
Relationship Named Partner
CertifiedSolution Resells Joint
Engr
Microsoft
HP
SAS
SAP
IBM
Pivotal
Redhat
Teradata
Informatica
Oracle
It is not just about packaging and certifying software…
Our joint engineering with our partners drives open source standards for Apache Hadoop
HDP is Apache Hadoop
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP delivers a Centralized Architecture
Modern Data Architecture
• Unifies data and processing.
• Enables applications to have access to all your enterprise data through an efficient centralized platform
• Supported with a centralized approach governance, security and operations
• Versatile to handle any applications and datasets no matter the size or type
Clickstream Web & Social
Geolocation Sensor & Machine
Server Logs
Unstructured
SOU
RCES
Existing Systems
ERP CRM SCM
ANAL
YTIC
S
Data Marts
Business Analytics
Visualization& Dashboards
ANAL
YTIC
S
Applications Business Analytics
Visualization& Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS (Hadoop Distributed File System)
YARN: Data Operating System
Interactive Real-TimeBatch Partner ISVBatch BatchMPP
EDW
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP delivers a completely open data platform
Hortonworks Data Platform 2.2
Hortonworks Data Platform provides Hadoop for the Enterprise: a centralized architecture of core enterprise services, for any application and any data.
Completely Open
• HDP incorporates every element required of an enterprise data platform: data storage, data access, governance, security, operations
• All components are developed in open source and then rigorously tested, certified, and delivered as an integrated open source platform that’s easy to consume and use by the enterprise and ecosystem.
YARN: Data Operating System(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Apa
che
Pig
° °
° °
° ° °
° ° °
HDFS (Hadoop Distributed File System)
GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
Apache Falcon
Apa
che
Hiv
e
Cas
cad
ing
Apa
che
HB
ase
Apa
che
Acc
umul
o
Apa
che
So
lr
Apa
che
Sp
ark
Apa
che
Sto
rm
Apache Sqoop
Apache Flume
Apache Kafka
SECURITY
Apache Ranger
Apache Knox
Apache Falcon
OPERATIONS
Apache Ambari
Apache Zookeeper
Apache Oozie
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Real World Use Case:Trucking Company
Spring 2015
Hortonworks. We do Hadoop.
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Scenario Overview.
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Trucking company w/ large fleet of trucks in Midwest
A truck generates millions of events for a given route; an event could be:
'Normal' events: starting / stopping of the vehicle
‘Violation’ events: speeding, excessive acceleration and breaking, unsafe tail distance
Company uses an application that monitors truck locations and violations from the truck/driver in real-time
Route?Truck?Driver?
Analysts query a broad history to understand if today’s violations are part of a larger problem with specific routes, trucks, or drivers
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing (Storm)
Inbound Messaging(Kafka)
Real-time Serving (HBase)
Alerts & Events(ActiveMQ)
Real-Time User Interface
One cluster with consistent security, governance & operations
SQL
Interactive Query(Hive on Tez)
Truck Sensors
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing (Storm)
Inbound Messaging(Kafka)
Real-time Serving (HBase)
Alerts & Events(ActiveMQ)
Real-Time User Interface
One cluster with consistent security, governance & operations
SQL
Interactive Query(Hive on Tez)
Truck Sensors
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
What is Kafka? APACHE KAFKA
High throughput distributed messaging system
Publish-Subscribe semantics but re-imagined at the implementation level to operate at speed with big data volumes
Kafka @LinkedIn: 800 billion messages per day 175 terabytes of data written per day 650 terabytes of data read per day Over 13 million messages/2.75GB of data
per second
Kafka Cluster
producer
producer
producer
consumer
consumer
consumer
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Kafka: Anatomy of a TopicPartition 0 Partition 1 Partition 2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10
11 11
12
Writes
Old
New
APACHE KAFKA
Partitioning allows topics to scale beyond a single machine/node
Topics can also be replicated, for high availability.
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing (Storm)
Inbound Messaging(Kafka)
Real-time Serving (HBase)
Alerts & Events(ActiveMQ)
Real-Time User Interface
One cluster with consistent security, governance & operations
SQL
Interactive Query(Hive on Tez)
Truck Sensors
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Apache Storm
• Distributed, real time, fault tolerant Stream Processing platform.• Provides processing guarantees.• Key concepts include:
•Tuples•Streams•Spouts•Bolts•Topology
Page 22
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Tuples and Streams
• What is a Tuple?– Fundamental data structure in Storm. Is a named list of values that can be of any data type.
Page 23
• What is a Stream?– An unbounded sequences of tuples.– Core abstraction in Storm and are what you “process” in Storm
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Spouts
• What is a Spout?– Generates or a source of Streams– E.g.: JMS, Twitter, Log, Kafka Spout– Can spin up multiple instances of a Spout and dynamically adjust as needed
Page 24
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Bolts
• What is a Bolt?– Processes any number of input streams and produces output streams– Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting
logic– Can spin up multiple instances of a Bolt and dynamically adjust as needed
• Bolts used in the Use Case:1. HBaseBolt: persisting and counting in Hbase2. HDFSBolt: persisting into HFDS as Avro Files using Flume3. MonitoringBolt: Read from Hbase and create alerts via email and a message to ActiveMQ if the
number of illegal driver incidents exceed a given threshhold.
Page 25
Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Topology
• What is a Topology?– A network of spouts and bolts wired together into a workflow
Page 26
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing (Storm)
Inbound Messaging(Kafka)
Real-time Serving (HBase)
Alerts & Events(ActiveMQ)
Real-Time User Interface
One cluster with consistent security, governance & operations
SQL
Interactive Query(Hive on Tez)
Truck Sensors
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Key Constructs in Apache HBase• HBase = Key / Value store• Designed for petabyte scale• Supports low latency reads, writes and updates• Key features
– Updateable records– Versioned Records– Distributed across a cluster of machines– Low Latency– Caching
• Popular use cases:– User profiles and session state– Object store– Sensor apps
Page 28
Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Data Assignment
Page 29
HBase Table
Keys within HBaseDivided among
different RegionServers
Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Data Access
• Get– Retrieves a single cell, all cells with a matching rowkey, or all cells in a column
family with a matching rowkey
• Put– Inserts a new version of a cell.
• Scan– The whole table, row by row, or a section of that table starting at a particular
start key and ending at a particular end key
• Delete– It is actually a version of put(Add a new version with put with a deletion
marker)
• SQL via Apache Phoenix– Unique capability in the NoSQL market
Page 30
Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing (Storm)
Inbound Messaging(Kafka)
Real-time Serving (HBase)
Alerts & Events(ActiveMQ)
Real-Time User Interface
One cluster with consistent security, governance & operations
SQL
Interactive Query(Hive on Tez)
Truck Sensors
Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
20092006
1 ° ° ° ° °
° ° ° ° ° N
HDFS (Hadoop Distributed File
System)
MapReduceLargely Batch Processing
Hadoop w/ MapReduce
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS (Hadoop Distributed File System)
Hadoop2 & YARN based Architecture
Silo’d clustersLargely batch systemDifficult to integrate
MR-279: YARN
Hadoop 2 & YARN
Interactive Real-TimeBatch
Architected & led development of YARN to enable the Modern Data Architecture
October 23, 2013
Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Benefits of YARN as the Data Operating System
• The container based model allows for running nearly any workload.– Enables the centralized architecture.– No longer is MapReduce the only data processing engine.– Docker containers managed by YARN. Yes Please!
• Decouples resource scheduling from application lifecycle.– Improved scalability and fault tolerence
• Dynamically allocated resources, resulting in HUGE utilization gains– Versus static allocation of “slots” in Hadoop 1.0
Page 33
Yahoo has over 30000 nodes running YARN across over 365PB of data. They calculate running about 400,000 jobs per day for about 10 million hours of compute time.
They also have estimated a 60% – 150% improvement on node usage per day since moving to YARN.
Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing (Storm)
Inbound Messaging(Kafka)
Real-time Serving (HBase)
Alerts & Events(ActiveMQ)
Real-Time User Interface
One cluster with consistent security, governance & operations
SQL
Interactive Query(Hive on Tez)
Truck Sensors
Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Apache HDFS – Hadoop Distributed File System
• Very large scale distributed file system• 10K nodes, tens of millions files and PBs of data• Supports large files
• Designed to run on commodity hardware, assumes hardware failures
• Files are replicated to handle hardware failure• Detect failures and recovers from them automatically
• Optimized for Large Scale Processing• Data locations are exposed so that the computations can move to where data
resides• Data Coherency
• Write once and read many times access pattern• Files are broken up in chunks called ‘blocks’
• Blocks are distributed over nodes
Page 35
Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved© Hortonworks Inc. 2012 Professional Services
Streaming Demo - High Level Architecture
Distributed Storage: HDFS
YARN
Storm Stream Processing
Kakfa Spout
HBase
Dangerous Events TableHbase
BoltHDFSBolt
Truck Events
Active MQ
Monitoring Bolt
Web App
Truck Streaming Data
T(1) T(2) T(N)
Inbound Messaging(Kafka)
Truck Events Topic
Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo – Streaming Dashboard.
Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Lab #1: bit.ly/1L3RLMoLab #2: bit.ly/1FW7ENl (<-lower case L)Lab #3: bit.ly/1L3S0ahShell cheatsheet: bit.ly/1JN8EsOSlides: bit.ly/1MtVoIL (<-capital I)Twitter demo: github.com/abajwa-hw/hdp22-twitter-demoCustom services: github.com/hortonworks-gallerywebinars: hortonworks.com/partners/learn email: abajwa@IoT demo: youtube.com/watch?v=FHMMcMYhmNI