Architecting a Next Generation Data Platform
-
Upload
hadooparchbook -
Category
Engineering
-
view
1.825 -
download
4
Transcript of Architecting a Next Generation Data Platform
Hadoop Application Architectures:
Architecting a Next Generation Data
PlatformStrata + Hadoop World, San Jose 2017tiny.cloudera.com/app-arch-sanjose
tiny.cloudera.com/app-arch-questionsMark Grover | @mark_groverTed Malaska | @ted_malaska
Jonathan Seidman | @jseidmanGwen Shapira | @gwenshap
Questions? tiny.cloudera.com/app-arch-questions
Logistics▪ Break at 3:00 – 3:30 PM▪Questions at the end of each section▪ Slides at tiny.cloudera.com/app-arch-sanjose▪Code at https://github.com/hadooparchitecturebook/Taxi360
Questions? tiny.cloudera.com/app-arch-questions
About the book▪@hadooparchbook▪ hadooparchitecturebook.com▪ github.com/hadooparchitecturebook▪ slideshare.com/hadooparchbook
Questions? tiny.cloudera.com/app-arch-questions
About the presenters
▪ Technical Group Architect at Blizzard Entertainment▪ Principal Solutions Architect
at Cloudera▪ Big Data Archicect at FINRA▪ Contributor to Apache HDFS,
HBase, Flume, Avro, Pig, Spark, YARN, Sqoop, Kudu, Kafka
Ted Malaska
Questions? tiny.cloudera.com/app-arch-questions
About the presenters
▪ Software Engineer on Spark at Cloudera▪ Committer on Apache Bigtop,
PMC member on Apache Sentry(incubating)▪ Contributor to Apache Spark,
Hadoop, Hive, Sqoop, Pig, Flume
Mark Grover
Questions? tiny.cloudera.com/app-arch-questions
About the presenters
▪ Software Engineer at Cloudera▪ Contributor to Apache Sqoop.▪ Previously Technical Lead on
the big data team at Orbitz, co-founder of the Chicago Hadoop User Group and Chicago Big Data
Jonathan Seidman
Questions? tiny.cloudera.com/app-arch-questions
About the presenters
▪ Product Manager at Confluent▪ PMC for Apache Kafka.▪ Previously software engineer
at Cloudera▪@gwenshap on twitter
Gwen Shapira
Questions? tiny.cloudera.com/app-arch-questions
Entity (Taxi) 360 View
Geo-location/ Traffic Data
Customer Data
Maintenance Data
Other Data Sources
Streaming Vehicle Data
Questions? tiny.cloudera.com/app-arch-questions
What Makes Hadoop a Fit?
Data Sources Extract Transform Load
The early days…
Questions? tiny.cloudera.com/app-arch-questions
What Makes Hadoop a Fit?
SERVERS MARTS EDWS DOCUMENTS STORAGE SEARCH ARCHIVE
ERP,CRM,RDBMS,MACHINES FILES,IMAGES,VIDEOS,LOGS,CLICKSTREAMS EXTERNALDATASOURCES
Today…
Questions? tiny.cloudera.com/app-arch-questions
Enabling a Range of New Use Cases…
Fraud Detection Market Transactions
Internet of Things
Network Security
Questions? tiny.cloudera.com/app-arch-questions
Challenges - Architectural Considerations ▪ Reliable and scalable ingress of multiple data types and sources:
- High volume event data? Batch data?▪ Reliable and scalable storage to support multiple workloads and access patterns
- Historical data? Real-time search? Analytics?▪ Processing engines (for background processing):
- Stream processing? Batch processing?▪ Data Modeling
- Modeling data for real-time random access? Analytic access? Batch access?
Questions? tiny.cloudera.com/app-arch-questions
Requirements▪ Allow users (technical and non-technical) to analyze and visualize data…
Questions? tiny.cloudera.com/app-arch-questions
Requirements▪ Provide analysts with query capabilities via a standard interface…
Questions? tiny.cloudera.com/app-arch-questions
Requirements▪ Provide developers the ability to perform batch processing on historical data…
Questions? tiny.cloudera.com/app-arch-questions
Requirements▪ To support all this, we need:
- Reliable ingestion of streaming and batch data.- Ability to perform transformations on streaming data in flight.- Ability to perform sophisticated processing of historical data.
Questions? tiny.cloudera.com/app-arch-questions
High level architectureSource Transport Stream
ProcessingStorage Access
Data Producer Pub-SubProcessing &
Ingestion Engine
NestedTables
IndexedCube
RelationalTables
Entity Time Series Lookup
Batch Processing
SQL
NRT REST
NRT Dashboard
Questions? tiny.cloudera.com/app-arch-questions
High level architectureTransportSource Stream
ProcessingStorage Access
Data Producers
Processing & Ingestion Engine
NestedTables
IndexedCube
RelationalTables
Entity Time Series Lookup
Batch Processing
SQL
NRT REST
NRT Dashboard
Pub Sub
Questions? tiny.cloudera.com/app-arch-questions
Key to Customer 360 SuccessYour project is only as good as the quality and variety of data sources
Geo-location/ Traffic Data
Customer DataMaintenance Data
Other Data Sources
Streaming Vehicle Data
Files
CSV? XML? JSON?
Twitter?Mainframe?
Database Salesforce?
MQTT
Questions? tiny.cloudera.com/app-arch-questions
Get the Data: Flume vs. Kafka▪ Flume – well integrated with Hadoop.
- Part of Hadoop ecosystem- Great choice when ingesting data into HDFS.- Can support simple transformations.
▪ Kafka – flexible, get-everything pipe▪ Producers in ~ 20 languages▪ REST API▪ Huge connector ecosystem
Questions? tiny.cloudera.com/app-arch-questions
REST ProxyTalking to Non-native Kafka Apps and Outside the Firewall
REST Proxy
Non-Java Applications
Native Kafka Java Applications
REST / HTTP
Simplifies administrative actions
Simplifies message creation and consumption
Provides a RESTful interface to a Kafka cluster
Questions? tiny.cloudera.com/app-arch-questions
Kafka ConnectStreaming Data Capture
JDBC
Logs
MQTT
RDBMS
Key/Value
HDFS
Kafka Connect API
Kafka
Connector
Connector
Connector
Connector
Connector
Connector
Sources Sinks
Fault tolerant
Manage hundreds of data sources and sinks
Preserves data schema
Part of Apache Kafka project
Includes simple transformations
Questions? tiny.cloudera.com/app-arch-questions
Ecosystem of ConnectorsDatabases Datastore/File Store
Analytics Applications / Other
Questions? tiny.cloudera.com/app-arch-questions
How Connect Works?
Log Connector
MQTTConnector
REST API
Logs MQTT
Log Task Log Task
MQTTTask
MQTTTask
Questions? tiny.cloudera.com/app-arch-questions
Schema Registry
Elastic
Cassandra
HDFS
Example Consumers
SerializerSource 1
SerializerSource 2
!
Kafka Topic!
Schema Registry
Define the expected fields for each Kafka topic
Automatically handle schema changes (e.g. new fields)Prevent backwards incompatible changes
Get different data sources to talk the same language
Questions? tiny.cloudera.com/app-arch-questions
High level architectureTransportSource Stream
ProcessingStorage Access
Processing & Ingestion Engine
NestedTables
IndexedCube
RelationalTables
Entity Time Series Lookup
Batch Processing
SQL
NRT REST
NRT Dashboard
Pub Sub
Questions? tiny.cloudera.com/app-arch-questions
High level architectureSource Buffer Stream
ProcessingStorage Access
Pub-SubProcessing &
Ingestion Engine
NestedTables
IndexedCube
RelationalTables
Entity Time Series Lookup
Batch Processing
SQL
NRT REST
NRT Dashboard
Data Producers
Questions? tiny.cloudera.com/app-arch-questions
Buffering Data▪ What do we mean by “buffering” and why do we need it?
event,event,event,event,event,event…
This is bad!
▪ Network partitions happen▪ Producers and Consumers
work at different rates
▪ Reliable storage is hard
Stream processing is hard
Lets do one at a time
Questions? tiny.cloudera.com/app-arch-questions
Buffering Data – Message Brokers
Publisher
Publisher
Publisher
Message Queue
Subscriber
Subscriber
Subscriber
Questions? tiny.cloudera.com/app-arch-questions
High level architectureSource Buffer Stream
ProcessingStorage Access
Processing & Ingestion Engine
NestedTables
IndexedCube
RelationalTables
Entity Time Series Lookup
Batch Processing
SQL
NRT REST
NRT Dashboard
Questions? tiny.cloudera.com/app-arch-questions
What is Kafka?▪ It’s like a message queue, right?
- Actually, it’s a “distributed commit log”- Or “streaming data platform”
0 1 2 3 4 5 6 7 8
Data Source
Data Consumer
A
Data Consumer
B
Questions? tiny.cloudera.com/app-arch-questions
Topics and Partitions▪ Messages are organized into topics, and each topic is split into partitions.
- Each partition is an immutable, time-sequenced log of messages on disk.- Note that time ordering is guaranteed within, but not across, partitions.
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
Partition 0
Partition 1
Partition 2
Data SourceTopic
Questions? tiny.cloudera.com/app-arch-questions
Consumers
In Our Architecture
Taxi Trip Data
Producer
Kafka
taxi-trip-input
Topic
Stream Processing(Analytic)
Stream Processing(Lookup)
Stream Processing
(Search)
Stream Processing
(Long Term)
Questions? tiny.cloudera.com/app-arch-questions
Input Events
CMT,2009-01-05 08:31:55,2009-01-05 8:37:50,1,0.90000000000000002,-73.977936999999997,40.745919000000001,,,-73.983609000000001,40.755051000000002,Credit,5.2999999999999998,0,,0.79000000000000004,0,6.0899999999999999
vendor_name,Trip_Pickup_DateTime,Trip_Dropoff_DateTime,Passenger_Count,Trip_Distance,Start_Lon,Start_Lat,Rate_Code,store_and_forward,End_Lon,End_Lat,Payment_Type,Fare_Amt,surcharge,mta_tax,Tip_Amt,Tolls_Amt,Total_Amt
Questions? tiny.cloudera.com/app-arch-questions
Kafka Considerations – Reliability▪ But remember there are tradeoffs…
Questions? tiny.cloudera.com/app-arch-questions
Kafka Reliability – ReplicationProducer
Bro
ker
Parti
tion
1
Parti
tion
2
Parti
tion
3
Lead
er
Questions? tiny.cloudera.com/app-arch-questions
Kafka Reliability – ReplicationProducer
Bro
ker
Parti
tion
1
Parti
tion
2
Parti
tion
3
Questions? tiny.cloudera.com/app-arch-questions
Kafka Considerations – Reliability▪ Different reliability levels for topics:
Taxi Trip Data
Kafka
taxi-trip-input
Twitter customer-sentiment
100% – dups are ok
(“At least once”)
<=100%(“At most
once”)
News Flash:Kafka’s Exactly Once
Producer is on the way
Questions? tiny.cloudera.com/app-arch-questions
Kafka Reliability – ReplicationProducer
Bro
ker
Parti
tion
1
Parti
tion
2
Parti
tion
3
Bro
ker
Parti
tion
1
Parti
tion
2
Parti
tion
3
Lead
er
Questions? tiny.cloudera.com/app-arch-questions
Kafka Reliability– ReplicationProducer
Bro
ker
Parti
tion
1
Parti
tion
2
Parti
tion
3
Bro
ker
Parti
tion
1
Parti
tion
2
Parti
tion
3
Lead
er
Lead
er
Questions? tiny.cloudera.com/app-arch-questions
Kafka Reliability – ReplicationProducer
Bro
ker
Parti
tion
1
Parti
tion
2
Parti
tion
3
Bro
ker
Parti
tion
1
Parti
tion
2
Parti
tion
3
Bro
ker
Parti
tion
1
Parti
tion
2
Parti
tion
3
Lead
er
Questions? tiny.cloudera.com/app-arch-questions
Kafka Reliability – Replication▪ So how does this relate to our application?
kafka-topics --zookeeper ZKHOST:ZKPORT –partition 2 --replication-factor 3 \--create --topic taxi-trip-input
kafka-topics --zookeeper ZKHOST:ZKPORT –partition 2 --replication-factor 1 \--create –topic customer-sentiment
Questions? tiny.cloudera.com/app-arch-questions
Kafka Reliability – Producers
Taxi Trip Data
Kafka
taxi_trip_input
Partition 1
Partition 2
Partition 3
Topic B
Partition 1
Partition 2
Partition 3
Message failure?
Producer
Resend message
acks=all
Questions? tiny.cloudera.com/app-arch-questions
Kafka Reliability – Producers▪ What about duplicates?
Taxi Trip Data
Kafka
taxi_trip_input
Partition 1
Partition 2
Partition 3
Topic B
Partition 1
Partition 2
Partition 3
Producer
ID Message1000 2009-01-04 03:02:00,1,2.629,...1001 2009-01-04 03:38:00,3,4.549…1001 2009-01-04 03:38:00,3,4.549…
Questions? tiny.cloudera.com/app-arch-questions
Kafka Scaling – Partitions
Producer
Kafka
taxi-trip-input
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer
Consumer
Consumer
Questions? tiny.cloudera.com/app-arch-questions
Kafka Scaling – Partitions
Producer
Kafka
taxi-trip-input
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer
Consumer
Consumer
Partition 4
Partition 5
Consumer
Consumer
Higher throughput
Higher throughput
More resources (memory)
More resources
(file handles)
Producer
Questions? tiny.cloudera.com/app-arch-questions
How many partitions?▪ Adding partitions late in the game is painful▪ Basic formula:total desired throughput / throughput of slowest consumer or producer▪ Or ~25GB disk space▪ Not too many because:
- Each partition takes broker heap memory and file handles- Each partition slows down node shutdown / recovery- 1000 – 4000 partitions per broker max- Producers will produce smaller batches – lower throughput
Questions? tiny.cloudera.com/app-arch-questions
Kafka Scaling – Producers
Producer
Kafka
taxi-trip-input
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer
Consumer
Consumer
Partition 4
Partition 5
Consumer
Consumer
Producer
Questions? tiny.cloudera.com/app-arch-questions
Guarding Against Message Loss▪ Producer – What happens if the producer loses connection to Kafka and the buffer overflows?- You get an exception. You can choose to… block? Write to file?
▪ Source – What happens if events are lost before getting sent to producer?- Once again use some kind of buffer to provide sufficient retention of data.
Questions? tiny.cloudera.com/app-arch-questions
High level architectureSource Transport Stream
ProcessingStorage Access
Custom Producer
or
Processing & Ingestion Engine
NestedTables
IndexedCube
RelationalTables
Entity Time Series Lookup
Batch Processing
SQL
NRT REST
NRT Dashboard
Questions? tiny.cloudera.com/app-arch-questions
Streaming agenda▪ What do we mean by streaming?▪ Streaming use-cases▪ Streaming semantics▪ Which streaming engine to choose?▪ Streaming in our use-case
Questions? tiny.cloudera.com/app-arch-questions
What do we mean by streaming?
Constant low milliseconds & under
Low milliseconds to seconds, delay in case of failures
10s of seconds or more, re-run in case
of failures
Real-time Near real-time Batch
Questions? tiny.cloudera.com/app-arch-questions
What do we mean by streaming?
Constant low milliseconds & under
Low milliseconds to seconds, delay in case of failures
10s of seconds or more, re-run in case
of failures
Real-time Near real-time Batch
Questions? tiny.cloudera.com/app-arch-questions
But, there’s no free lunch
Constant low milliseconds & under
Low milliseconds to seconds, delay in case of failures
10s of seconds or more, re-run in case
of failures
Real-time Near real-time Batch
“Difficult” architectures, lower latency
“Easier” architectures, higher latency
Questions? tiny.cloudera.com/app-arch-questions
Streaming Use-cases▪ Ingestion (most relevant in our use-case)▪ Simple transformations
- Decision (e.g. anomaly detection)- Enrichment (e.g. add a state based on zipcode)▪ Advanced usage
- Machine Learning- Windowing
Questions? tiny.cloudera.com/app-arch-questions
#1 - Simple ingestion
BufferEvent e Stream
Processing Long term storage
Event e
Questions? tiny.cloudera.com/app-arch-questions
#2 - Enrichment
BufferEvent e Stream
Processing Storage
Event e’
e’ = enriched event e
Context store
Questions? tiny.cloudera.com/app-arch-questions
#2 - Decision
BufferEvent e Stream
Processing Storage
Event e’
e’ = e + decision
Rules
Questions? tiny.cloudera.com/app-arch-questions
#3 – Advanced usage
BufferEvent e Stream
Processing Storage
Event e’
e’ = aggregation or windowed aggregation
Model
Questions? tiny.cloudera.com/app-arch-questions
#1 – Simple Ingestion1. Zero transformation
- No transformation, plain ingest- Keep the original format – SequenceFile, Text, etc.- Allows to store data that may have errors in the schema
2. Format transformation- Simply change the format of the field- To a structured format, say, Avro, for example- Can do schema validation
3. Atomic transformation- Mask a credit card number
Questions? tiny.cloudera.com/app-arch-questions
#2 - Enrichment
BufferEvent e Stream
Processing Storage
Event e’
e’ = enriched event e
Context storeNeed to store the
context somewhere
Questions? tiny.cloudera.com/app-arch-questions
Where to store the context?1. Locally Broadcast Cached Dim Data
- Local to Process (On Heap, Off Heap)- Local to Node (Off Process)
2. Partitioned Cache- Shuffle to move new data to partitioned cache
3. External Fetch Data (e.g. HBase, Memcached)
Questions? tiny.cloudera.com/app-arch-questions
#1a - Locally broadcast cached data
Could be On heap or Off heap
Questions? tiny.cloudera.com/app-arch-questions
#1b - Off process cached dataData is cached on the
node, outside of process. Potentially in an external system like
Rocks DB
Questions? tiny.cloudera.com/app-arch-questions
#2 - Partitioned cache data
Data is partitioned based on field(s) and
then cached
Questions? tiny.cloudera.com/app-arch-questions
#3 - External fetch
Data fetched from external system
Questions? tiny.cloudera.com/app-arch-questions
Delivery Types▪ At most once
- Not good for many cases- Only where performance/SLA is more important than accuracy
▪ Exactly once- Expensive to achieve but desirable
▪ At least once- Easiest to achieve
Questions? tiny.cloudera.com/app-arch-questions
Semantics of our architectureSource System 1
DestinationsystemSource System 2
Source System 3
Ingest Extract Streaming engine
Push
Message broker
Questions? tiny.cloudera.com/app-arch-questions
Classification of storage systems▪ File based
- S3- HDFS▪ NoSQL
- HBase- Cassandra▪ Document based
- Search▪ NoSQL-SQL
- Kudu
Questions? tiny.cloudera.com/app-arch-questions
Classification of storage systems▪ File based
- S3- HDFS▪ NoSQL
- HBase- Cassandra▪ Document based
- Search▪ NoSQL-SQL
- Kudu
De-duplication at file level
Semantics at key/record level
Questions? tiny.cloudera.com/app-arch-questions
High level architectureSource Transport Stream
ProcessingStorage Access
Processing & Ingestion Engine
NestedTables
IndexedCube
RelationalTables
Entity Time Series Lookup
Batch Processing
SQL
NRT REST
NRT Dashboard
Apache Beam
Kafka Streams
Questions? tiny.cloudera.com/app-arch-questions
Requirements▪ Fault-tolerant and distributed▪ Effectively once semantics▪Handle processing time vs. event time▪ Allow stateful transformations
Questions? tiny.cloudera.com/app-arch-questions
Spark Streaming▪Micro batch based architecture▪ Allows stateful transformations ▪ Feature rich
- Windowing- Sessionization- ML- SQL (Structured Streaming)
Questions? tiny.cloudera.com/app-arch-questions
Spark Streaming
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
First Batc
h
Second Batch
Questions? tiny.cloudera.com/app-arch-questions
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Source Receiver RDDpartitions
RDDParition
RDD
Single PassFilter Count
Pre-first Batch
First Batc
h
Second Batch
Stateful RDD 1
Stateful RDD 2
Stateful RDD 1
Spark Streaming
Questions? tiny.cloudera.com/app-arch-questions
Spark Streaming - Gaps▪ Not as low of a latency- Efforts towards reducing latency e.g. RISElab’s Drizzle▪Global consistent execution state- Stop overall execution of distributed computation- Eagerly persist records in transit meaning larger snapshots
Questions? tiny.cloudera.com/app-arch-questions
Flink▪ True “streaming” system, but not as feature rich as Spark▪Much better event time handling▪Good built-in backpressure support▪ Allows stateful transformations▪ Lower Latency
- No Micro Batching- Asynchronous Barrier Snapshotting (ABS)
Questions? tiny.cloudera.com/app-arch-questions
Operator
Buffer
Operator
Buffer
Flink - ABS
Barrier 1A Hit
Barrier 1B Still Behind
Questions? tiny.cloudera.com/app-arch-questions
Operator
Buffer
Flink - ABS
Both Barriers Hit
Operator
Buffer
Barrier 1A Hit
Barrier 1B Still Behind
Questions? tiny.cloudera.com/app-arch-questions
Operator
Buffer
Flink - ABSBoth Barriers
Hit
Operator
Buffer Barrier is combined and can move on
Buffer can be flushed out
Questions? tiny.cloudera.com/app-arch-questions
Storm▪Old school▪Didn’t manage state – had to use Trident▪No good support for batch processing
Questions? tiny.cloudera.com/app-arch-questions
Samza▪Good integration with Kafka▪Doesn’t support batch▪ Forked by Kafka Streams
Questions? tiny.cloudera.com/app-arch-questions
Flume▪Well integrated with the Hadoop ecosystem▪ Allowed interceptors (for simple transformations)▪ Supports buffering- Memory- File- Kafka▪ But no real fault-tolerance▪No state management
Questions? tiny.cloudera.com/app-arch-questions
Kafka Streams▪Good integration with Kafka▪ Light-weight library (not a framework)▪No micro-batching, uses Kafka as internal messaging layer▪Maintains local state per node (in RocksDB, or in memory
hash map)▪Handles late events▪ Stream-to-stream joins
Questions? tiny.cloudera.com/app-arch-questions
Topic
Partition 1
Partition 2
Task 1 Re-partition topic
Partition 1
Partition 2
Task 3
Task 2 Task 4
Kafka Streams architecture
Questions? tiny.cloudera.com/app-arch-questions
Apache Beam▪ Abstraction on top of Streaming Engines▪ Best support for Google Dataflow
Questions? tiny.cloudera.com/app-arch-questions
Spark Streaming▪We chose Spark Streaming because:- Same execution engine for batch and streaming- Similar code for batch and streaming- Support for security, Kafka integration- Thriving community
Questions? tiny.cloudera.com/app-arch-questions
High level architectureSource Transport Stream
ProcessingStorage Access
NestedTables
IndexedCube
RelationalTables
Entity Time Series Lookup
Batch Processing
SQL
NRT REST
NRT Dashboard
Questions? tiny.cloudera.com/app-arch-questions
High level architectureSource Transport Stream
ProcessingStorage Access
NestedTables
IndexedCube
RelationalTables
Entity Time Series Lookup
Batch Processing
SQL
NRT REST
NRT Dashboard
Questions? tiny.cloudera.com/app-arch-questions
Structured Landing ZonesHive Relational Model
Kudu/HDFS
Hive Nested ModelHDFS
AggregationsKudu
HBase Entity Time Series
Solr
Traditional SQL
Optimized for nested Structures like JSON
Optimized Storing and mutating aggregates
Optimized Entity 360 and time base access
Optimized faceted charts and reverse index look ups
Questions? tiny.cloudera.com/app-arch-questions
Relational▪ Everyone knows it▪ Simple▪ Very painful to do large Join▪ May lead to customers making bad queries▪ Easier to mutate
Questions? tiny.cloudera.com/app-arch-questions
Kudu Data Models▪ Entity Summary Tables
- Quick update and access of aggregate of Entity Stats▪ Event Tables
- Number of Partitioning strategies- Partition by Entity- Partition by Hash on time
Questions? tiny.cloudera.com/app-arch-questions
Kudu: Table Creation ExampleCREATE EXTERNAL TABLE ny_taxi_trip (
vender_id STRING,tpep_pickup_datetime TIMESTAMP,tpep_dropoff_datetime TIMESTAMP,passenger_count INT,trip_distance DOUBLE,pickup_longitude DOUBLE,pickup_latitude DOUBLE,rate_code_id STRING,store_and_fwd_flag STRING,dropoff_longitude DOUBLE,dropoff_latitude DOUBLE,payment_type STRING,fare_amount DOUBLE,extra DOUBLE,mta_tax DOUBLE,improvement_surcharge DOUBLE,tip_amount DOUBLE,tolls_amount DOUBLE,total_amount DOUBLE
)STORED AS PARQUETLOCATION 'usr/root/hive/ny_taxi_trip';
Questions? tiny.cloudera.com/app-arch-questions
Kudu: Data PopulationSparkStreamingTaxiTripToKudu.scala
Questions? tiny.cloudera.com/app-arch-questions
View Strategies
Hive Relational Model
Hive Nested Model
Mod
els
Hive Normal Views
Hive Materialized Table Views
Use in the cases where the view requires a join that is done through a shuffle
Use only for tables that filter records/columns or use for marking fields
Questions? tiny.cloudera.com/app-arch-questions
Nested▪ Less Space than Denormalization▪ Still have tables but the cost of joins is all but gone▪ Also great for cartesian joins
- N x M vs N + M▪ Not really supported yet with Kudu or HBase with SQL
Questions? tiny.cloudera.com/app-arch-questions
Nested Writing Example in Spark{"id": "0001","type": "donut","name": "Cake","ppu": 0.55,"batters":{"batter":[{ "id": "1001", "type": "Regular" },{ "id": "1002", "type": "Chocolate" },{ "id": "1003", "type": "Blueberry" },{ "id": "1004", "type": "Devil's Food" }]
},"topping":[{ "id": "5001", "type": "None" },{ "id": "5002", "type": "Glazed" },{ "id": "5005", "type": "Sugar" },{ "id": "5007", "type": "Powdered Sugar" },{ "id": "5006", "type": "Chocolate with Sprinkles" }]
Questions? tiny.cloudera.com/app-arch-questions
Nested Writing Example in Sparkval jsonDF = hiveContext.read.json(jsonRDD)
jsonDF.write.parquet("./parquet")
hiveContext.createExternalTable("jsonNestedTable", "./parquet")
Questions? tiny.cloudera.com/app-arch-questions
Entity Centric Time Series▪ Partition by Entity ID▪ Order by Time▪ Allows for free windowing▪ Allows for fetching of single time window of single entity at web scale
Questions? tiny.cloudera.com/app-arch-questions
HBase Entity Time Series
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40
Cust-F, 20
Cust-F, 30
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Questions? tiny.cloudera.com/app-arch-questions
HBase Entity Time SeriesCust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40
Cust-F, 20
Cust-F, 30
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Rest Call Short Scan
Questions? tiny.cloudera.com/app-arch-questions
HBase Entity Time Series
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40
Cust-F, 20
Cust-F, 30
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Mapper Mapper Mapper
Questions? tiny.cloudera.com/app-arch-questions
HBase Entity Time Series
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40
Cust-F, 20
Cust-F, 30
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Mapper
Mapper Mapper
Questions? tiny.cloudera.com/app-arch-questions
HBase: Row Key Example def generateRowKey(customerTrans: CustomerTran, numOfSalts:Int): Array[Byte] = {val salt = StringUtils.leftPad(
Math.abs(customerTrans.customerId.hashCode % numOfSalts).toString, 4, "0")
Bytes.toBytes(salt + ":" +customerTrans.customerId + ":" +StringUtils.leftPad(customerTrans.eventTimeStamp.toString, 11, "0") + ":trans:" +customerTrans.transId)
}
Questions? tiny.cloudera.com/app-arch-questions
HBase: Population ExampleSparkStreamingTaxiTripToHBase.scala
Questions? tiny.cloudera.com/app-arch-questions
Cassandra Time SeriesCREATE TABLE temperature (
weatherstation_id text,event_time timestamp,temperature text,PRIMARY KEY (weatherstation_id,event_time)
);
Questions? tiny.cloudera.com/app-arch-questions
Cassandra Time SeriesINSERT INTO
temperature(weatherstation_id,event_time,temperature)
SELECT event_time,temperatureFROM temperatureWHERE weatherstation_id=’1234ABCD’;
Questions? tiny.cloudera.com/app-arch-questions
Solr: Data Model▪ Think of it like a cube on a object type
- In our case a taxi trip- Allows for rollups and aggregations from object’s point of view- Think of objects as immutable- Try to find time based events- May design more than one object type
Questions? tiny.cloudera.com/app-arch-questions
Solr Details
1 Trip:101 1
2 Trip:102 1
3 Trip:103 1
ID Document Live
4 Trip:104 1
5 Trip:105 1
ID Field Value Documents
1 Cash 1,3
2 Credit 2
3 Debit 4,5
Questions? tiny.cloudera.com/app-arch-questions
Single Value Aggregations▪ Get Array Lengths
ID Field Value Documents
1 Cash 1,3
2 Credit 2
3 Debit 4,5
Questions? tiny.cloudera.com/app-arch-questions
Multi Value Aggregations▪ Ordered Merge Join
- Think like a zipper- Scans- No Lookups▪ Top N from both sides- Leaving the rest to other▪ Indexes distributed▪ No need to read document data
1 4 5 7 8 9 10 14 16
2 3 6 11 12 13 15 17 18
1 2 3 6 7 8 10 15 18
Cash
Credit
Vender A
4 5 9 11 12 13 14 16 17Vender B
Questions? tiny.cloudera.com/app-arch-questions
Solr: Population ExampleSparkStreamingTaxiTripToSolR.scala
Questions? tiny.cloudera.com/app-arch-questions
Storage
High level architectureSource Transport Stream
ProcessingAccess
Questions? tiny.cloudera.com/app-arch-questions
High level architectureSource Transport Stream
ProcessingStorage Access
NestedTables
IndexedCube
RelationalTables
Entity Time Series Lookup
Batch Processing
SQL
NRT REST
NRT Dashboard
Questions? tiny.cloudera.com/app-arch-questions
Why have batch processing?▪ When you need a larger context
- Say, to train a model▪ Complex periodic job that does something
- Convert data to a nested structure for reduced number of shuffles▪ In our use-case,
- Kudu -> HDFS Nested is batch processing- KMeans calculation is also in bash
Questions? tiny.cloudera.com/app-arch-questions
Batch processing options▪ Spark (+ MLlib)▪ MapReduce (+ Mahout)▪ Flink (+ Flink ML)
Questions? tiny.cloudera.com/app-arch-questions
Spark▪ Pretty popular▪ Much faster than MapReduce▪ Thriving community
Questions? tiny.cloudera.com/app-arch-questions
Flink▪ Pretty popular▪ Batch is a special case of Streaming▪ Developing community
Questions? tiny.cloudera.com/app-arch-questions
In our use-case▪ We chose Spark
- We were using Spark Streaming anyways- Similar code between Spark and Spark Streaming- Thriving community
Questions? tiny.cloudera.com/app-arch-questions
High level architectureSource Transport Stream
ProcessingStorage Access
NestedTables
IndexedCube
RelationalTables
Entity Time Series Lookup
Batch Processing
SQL
NRT REST
NRT Dashboard
Questions? tiny.cloudera.com/app-arch-questions
Types of data access▪ REST server/APIs for querying entities and aggregates▪ UI for displaying search facets▪ SQL engine
Questions? tiny.cloudera.com/app-arch-questions
Why have REST server?▪ Tired of business people telling us how to access data▪ Serves as an interface between the data engineers and business folks▪ Lets business folks decide access patterns▪ Engineers to optimize those patterns▪ Brownie points from your boss▪ And, it’s not that difficult to write!
Questions? tiny.cloudera.com/app-arch-questions
Don’t believe me?import org.mortbay.jetty.Serverimport org.mortbay.jetty.servlet.{Context, ServletHolder}…
val server = new Server(port)val sh = new ServletHolder(classOf[ServletContainer])sh.setInitParameter("com.sun.jersey.config.property.resourceConfigClass", "com.sun.jersey.api.core.PackagesResourceConfig")sh.setInitParameter("com.sun.jersey.config.property.packages", "com.hadooparchitecturebook.taxi360.server.hbase")sh.setInitParameter("com.sun.jersey.api.json.POJOMappingFeature", "true”)val context = new Context(server, "/", Context.SESSIONS)context.addServlet(sh, "/*”)server.start()server.join()
Questions? tiny.cloudera.com/app-arch-questions
Then, write a ServiceLayer@GET@Path("vender/{venderId}/timeline")@Produces(Array(MediaType.APPLICATION_JSON))def getTripTimeLine (@PathParam("venderId") venderId:String,
@QueryParam("startTime") startTime:String = Long.MinValue.toString,@QueryParam("endTime") endTime:String = Long.MaxValue.toString):
Array[NyTaxiYellowTrip] = {
Questions? tiny.cloudera.com/app-arch-questions
Use REST! Say no to business people!▪ Access data like so:http://<serverURL>:8080/vender/{venderId}/timeline
Questions? tiny.cloudera.com/app-arch-questions
UI requirementsSomething that can ▪ Represent search results really well▪ Integrates with Apache Solr on Hadoop
Questions? tiny.cloudera.com/app-arch-questions
We choose Hue▪ Because it’s included▪ Please look at the others
Questions? tiny.cloudera.com/app-arch-questions
SQL engine criteria▪ Low latency SQL access▪ Allows for high concurrency▪ JDBC/ODBC integration▪ Capable of large scale aggregation▪ Optionally integrates with Kudu for real-time updates to SQL tables
Questions? tiny.cloudera.com/app-arch-questions
Apache Hive▪ Good JDBC integration▪ Not really low latency, even when using Tez▪ Doesn’t integrate with Kudu▪ Can run with MapReduce, Spark, or Tez
Questions? tiny.cloudera.com/app-arch-questions
Presto▪ Low latency SQL engine from Facebook▪ Provides JDBC/ODBC access▪ Is only in-memory, large aggregations can lead to OOM errors▪ Doesn’t integrate with Kudu
Questions? tiny.cloudera.com/app-arch-questions
Apache Impala▪ Low latency SQL access▪ Provides JDBC/ODBC access▪ Excellent concurrency support▪ Integrates with Kudu for real-time SQL
Questions? tiny.cloudera.com/app-arch-questions
Apache Drill▪ Similar in architecture to Impala▪ Provides JDBC/ODBC access▪ Doesn’t integrate with Kudu
Questions? tiny.cloudera.com/app-arch-questions
Spark SQL▪ Builds on top of Spark▪ JDBC/ODBC access only via Spark Thrift Server
- Doesn’t scale well with larger number of concurrent users- Doesn’t fully provide secure access.
Questions? tiny.cloudera.com/app-arch-questions
High level architectureSource Transport Stream
ProcessingStorage Access
Processing & Ingestion Engine
NestedTables
IndexedCube
RelationalTables
Entity Time Series Lookup
Batch Processing
SQL
NRT Rest
NRT Dashboard
Questions? tiny.cloudera.com/app-arch-questions
High level architectureSource Transport Stream
ProcessingStorage Access
NestedTables
IndexedCube
RelationalTables
Entity Time Series Lookup
Batch Processing
SQL
NRT REST
NRT Dashboard
Questions? tiny.cloudera.com/app-arch-questions
Storage
High level architectureSource Transport Stream
ProcessingAccess
Batch Processing
SQL
NRT REST
NRT Dashboard
Questions? tiny.cloudera.com/app-arch-questions
Access
High level architectureSource Transport Stream
ProcessingStorage
Questions? tiny.cloudera.com/app-arch-questions
High level architectureSource Transport Stream
ProcessingStorage Access
Questions? tiny.cloudera.com/app-arch-questions
High Level of the Demo Design
Producer
KafkaTopic Foo
Partition 1
Partition 2
Partition 3
Spark Streaming
Kudu
Spark Streaming
HBase
Spark Streaming
Solr
Spark Streaming
HDFS
Kudu
HBase
Solr
HDFS
SQL
REST
REST
Hue
SQL
Questions? tiny.cloudera.com/app-arch-questions
Other Sessions▪ Ask Us Anything session (Mark and Jonathan) – Thursday, 11:50 AM▪ Stream me up, Scotty: Transitioning to the cloud using a streaming data platform
(Gwen) – Wednesday, 2:40 PM▪ One cluster does not fit all: Architecture patterns for multicluster Apache Kafka
deployments (Gwen) – Thursday, 2:40 PM▪ Ask me Anything (Gwen) – Thursday, 4:20