HPBigData2015 PSTL kafka spark vertica

download HPBigData2015 PSTL kafka spark vertica

of 28

Embed Size (px)

Transcript of HPBigData2015 PSTL kafka spark vertica

PowerPoint Presentation

8/7/15Jack Gudenkauf VP Big Datascala> sc.parallelize(List("Kafka Spark Vertica"), 3).mapPartitions(iter => { iter.toList.map(x=>print(x)) }.iterator).collect; println()https://twitter.com/_JG

|

1

2Agenda1. Background2. PSTL overviewParallelizedStreamingTransformation Loader3. Parallelism in Kafka, Spark, Vertica4. PSTL drill downParallelizedStreamingTransformation Loader

5. Vertica Performance!

2

3Agenda1. Background

3

4Playtika Founded in 2010 Social Casino global category leader 10 games 13 platforms 1000+ employees

4

A trifecta requires you to select the first three finishers in order and can lead to big pay-offs.Boxing lets your selections come in any order and you still win.Kafka - Spark - Vertica Placing Your Trifecta Box Bet on Kafka, Spark, and HP Vertica

5

https://www.linkedin.com/in/jackglinkedin6My BackgroundPlaytika, VP of Big DataFlume, Kafka, Spark, ZK, Yarn, Vertica. [Jay Kreps (Kafka), Michael Armbrust (Spark SQL), Chris Bowden (Dev Demigod)]MIS Director of several start-up companiesDataflex a 4GL RDBMS. [E.F. Codd]Self-employed ConsultantIntercept Dataflex db calls to store and retrieve data to/from Btrieve and IBM DB2 Mainframe FoxPro, Sybase, MSSQL Server betaDesign Patterns: Elements of Reusable Object-Oriented Software [The Gang of Four]Microsoft; Dev Manager, Architect CLR/.Net Framework, Product Unit Manager Technical Strategy GroupInventor of Shuttle, a Microsoft product in use since 1999 A distributed ETL based on MSMQ which influenced MSSQL DTS (SQL SSIS) [Joe Celko, Ralph Kimball, Steve Butler (Microsoft Architect)]Twitter, Manager of Analytics Data WarehouseCore Storage; Hadoop, HBase, Cassandra, Blob StoreAnalytics Infra; MySQL, PIG, Vertica (n-Petabyte capacity with Multi-DC DR)[Prof. Michael Stonebraker, Ron Cormier (Vertica Internals)]

My experience and influencers framed my architectural decisionsDataFlex General purpose (Not CICS, Cobol, DB2 mix)Velocity, Variety, and Volume, at scale6

7A QuestWith attributes ofOperational RobustnessHigh AvailabiltyStronger durability guaranteesIdempotent (an operation that is safe to repeat)ProductivityAnalyticsStreaming, Machine Learning, BI, BA, Data ScienceRich Development env.Strongly typed, OO, Functional, with support for set based logic and aggregations (SQL)PerformanceScalable in every tierMPP for Transformations, Reads & Writes

A Unified Data Pipeline with Parallelism from Streaming Datathrough Data Transformationsto Data Storage (Semi-Structured, Structured, and Relational Data)

Expereince with Data at Scale influenced my architectural decisionsProductivity - Spark SQL & Scala - REPL, stand alone or on Yarn on clusterhttp://spark.apache.org/docs/0.9.1/scala-programming-guide.html#transformations - map(func), filter(func), mapPartitionsWithIndex(func), groupByKey([numTasks])

7

REST APIFlume

Apache Flume

ETL JAVA Parser & LoaderMPP Columnar DW HP Vertica ClusterUserId UserGId

Analytics of Relational Data

Structured Relational and Aggregated Data

ApplicationApplicationGame ApplicationsUserId: INTSessionId: UUId (36)Bingo BlitzUserId: Int SessionId: UUId (32)SlotomaniaUserId: varchar(32) SessionId: varchar(255)WSOP

COPYPlaytika Santa Monicaoriginal ETL ArchitectureExtract Transform LoadSingle Source of Truths to Global SOTUnified SchemaJSONLocal Data WarehousesOriginal Architecture (ETL)12345

When I came to Playtika the Architecture looked like most ETLsThe good news is that they chose well with Vertica (no Trifecta)Bad News is; the ETL does not scale, is not highly available, and they made all the mistakes every Vertica RDBMS customer makes (including Twitter)Flume File/Event Sequencing issues, Local writes backing up, ETL (single App, GC, bringing data to the processing) 8

9Agenda1. Background2. PSTL overviewParallelizedStreamingTransformation Loader

9

Real-Time MessagingApache KafkaCluster Analytics of [semi]Structured [non]Relational Data Stores Real-Time Streaming Machine Learning Semi-Structured Raw JSON DataStructured (non)relational Parquet Data Structured Relational and Aggregated Data

ETLResilient Distributed Datasets Apache Spark Hadoop ParquetCluster

REST APIOr Local KafkaApplicationApplicationGame ApplicationsUserId: INTSessionId: UUId (36)Bingo BlitzUserId: Int SessionId: UUId (32)SlotomaniaUserId: varchar(32) SessionId: varchar(255)WSOPUnified SchemaJSONLocal Data WarehousesPSTL is the new ETLMPP Columnar DW HP Vertica Cluster

MPP123 Parallelized Streaming Transformation Loader45New PSTL ArchitectureNew PSTL Architecture

The new Spark Parallelism model and Lambda Architecture speaks to the streaming, .. Presentation layerIm a betting man and I guarantee you havent seen or heard of the new PSTL Pipeline You heard it here first!Vertica is MPP but mostly when it comes to Reads and the real power comes with parallelism coupled with Spark RDDs for writes

10

11Agenda1. Background2. PSTL overviewParallelizedStreamingTransformation Loader3. Parallelism in Kafka, Spark, Vertica

11

Apache Kafka is a distributed, partitioned, replicated commit log serviceProducerProducerProducerKafka Cluster(Broker)ConsumerConsumerConsumer

http://kafka.apache.org/documentation.html Authored at LinkedIn, used by Twitter, 12

A topic is a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log that looks likeEach partition is an ordered, immutable sequence of messages that is continually appended toa commit log. The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition.The Kafka cluster retains all published messageswhether or not they have been consumedfor a configurable period of time.

Kafka is Not a message Queue (Push/Pop)Apache Kafka

http://kafka.apache.org/documentation.html Authored at LinkedIn, used by Twitter, 13

Spark RDDA Resilient Distributed Dataset [in Memory]Represents an immutable, partitioned collection of elements that can be operated on in parallelNode 1Node 2Node 3Node RDD 1RDD 1Partition 1RDD 1Partition 2RDD 1Partition 3RDD 3RDD 3Partition 2RDD 3Partition 3RDD 3Partition 1RDD 2RDD 2

Partition 1 to 64RDD 2

Partition 65 to 128RDD 2Partition 193 to 256RDD 2Partition 129 to 192

We use Spark RDD partitioned data to parallelize opertaions to/from affinitized Vertica nodese.g., 3 Kafka Partitions would read in parallel into 3 Spark RDD PartitionsTypically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10))

14

15

Initiator Node An Initiator Node shuffles data to storage nodesVertica Hashing & Partitioning

SHUFFLE!15

16Agenda1. Background2. PSTL overviewParallelizedStreamingTransformation Loader3. Parallelism in Kafka, Spark, Vertica4. PSTL drill downParallelizedStreamingTransformation Loader

16

{"appId": 3, "sessionId": 7, "userId": 42 }{"appId": 3, "sessionId": 6, "userId": 42 }Node 1Node 2Node 3Node 4

3 Import recent Sessions

Apache Kafka ClusterTopic: appId_1 Topic: appId_2 Topic: appId_3

oldnewMySQL Kafka TableappId,TopicOffsetRange, Batch_IdSessionMax TablesessionGIdMax IntUserMax TableuserGIdMax IntappSessionMap_RDDappId: IntsessionId: StringsessionGId: IntappUserMap_RDDappId: IntuserId: StringuserGId: IntappSessionappId: IntsessionId: varchar(255)sessionGId: IntappUserappId: IntuserId: varchar(255)userGId: Int1 Start a Spark Driver per APPNode 1Node 2Node 34 Spark Kafka [non]Streaming job per APP (read partition/offset range) 5 select for update; update max GId5 Assign userGIds To userIdsessionGIds To sessionId

6 Hash(userGId) to RDD partitions with affinityTo Vertica Node (Parallelism)7 userGIdRDD.foreachPartition {stream.writeTo(socket)...}

8 Idempotent: Write Raw Unparsed JSON to hdfs

9 Idempotent: Write Parsed JSON to .parquet hdfs10 Update MySQLKafka Offsets{"appId": 2, "sessionId": 4, "userId": KA }{"appId": 2, "sessionId": 3, "userId": KY }{"appId": 1, "sessionId": 2, "userId": CB }{"appId": 1, "sessionId": "1, "userId": JG }4 appId {Game events, Users, Sessions,} Partition 1..n RDDs5 appId Users & Sessions Partition 1..n RDDs5 appId appUserMap_RDD.union(assignedID_RDD) RDDs6 appId Users & Sessions Partition 1..n RDDs7 copy jackg.DIM_USER with source SPARK(port='12345, nodes=node0001:4, node0002:4, node0003:4) direct;2 Import UsersApache Hadoop Spark Cluster

HP Vertica Cluster

17

18Agenda1. Background2. PSTL overviewParallelizedStreamingTransformation Loader3. Parallelism in Kafka, Spark, Vertica4. PSTL drill downParallelizedStreamingTransformation Loader

5. Vertica Performance!

18

Impressive Parallel COPY PerformanceLoaded 2.42 Billion Rows (451 GB) in 7min 35sec on an 8 Node ClusterKey TakeawaysParallel Kafka Reads to Spark RDD (in memory) with Parallel writes to a Vertica via tcp server ROCKS!COPY 36 TB/Hour with 81 Node clusterNo ephemeral nodes needed for ingest Kafka read parallelism to Spark RDD partitionsA priori hash() in Spark RDD Partitions (in Memory)TCP Server as a Vertica User Define Copy SourceSingle COPY does not preallocate Memory across nodes

http://www.vertica.com/2014/09/17/how-vertica-met-facebooks-35-tbhour-ingest-sl