Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark...

12
Berkley Data Analysis Stack Shark, Bagel

Transcript of Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark...

Page 1: Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.

Berkley Data Analysis StackShark, Bagel

Page 2: Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.

2

Previous Presentation Summary• Mesos, Spark, Spark Streaming

Infrastructure

Storage

Data Processing

Application

Resource Management

Data Management

Share infrastructure across frameworks(multi-programming for datacenters)

Efficient data sharing across frameworks

Data Processing• in-memory processing • trade between time, quality, and cost

ApplicationNew apps: AMP-Genomics, Carat, …

Page 3: Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.

3

Previous Presentation Summary• Mesos, Spark, Spark Streaming

Page 4: Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.

Spark Example: Log Mining• Load error messages from a log into memory,

then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDDTransformed RDD

Cached RDDParallel operation

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Page 5: Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.

Logistic Regression Performance127 s / iteration

first iteration 174 s

further iterations 6 s

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {…

}

println("Final w: " + w)

Page 6: Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.

HIVE: Components

HDFS

Hive CLI

DDLQueriesBrowsing

Map Reduce

MetaStore

Thrift API

SerDeThrift Jute JSON..

Execution

Hive QL

Parser

Planner

Mgm

t. W

eb

UI

Page 7: Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.

Data Model

Hive Entity

Sample Metastore Entity

Sample HDFS Location

Table T /wh/T

Partition date=d1 /wh/T/date=d1

Bucketing column

userid

/wh/T/date=d1/part-0000…/wh/T/date=d1/part-1000(hashed on userid)

External Table

extT/wh2/existing/dir(arbitrary location)

Page 8: Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.

Hive/Shark flowchart (Insert into table)Two ways to do this.

1. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS.2. Load “Buckets” directly. The user is responsible for creating buckets.

CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE;

Creates the table directory.

Page 9: Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.

Hive/Shark flowchart (Insert into table)Two ways to do this.

1. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS.

CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12' STORED AS TEXTFILE LOCATION '/user/data/staging/page_view';

hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view

Step 1

Step 2

Page 10: Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.

Hive/Shark flowchart (Insert into table)Two ways to do this.

1. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS.

Step 3

FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US';

Page 11: Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.

Hive

File on HDFS

HierarchicalObject

Writable

Stream Stream

HierarchicalObject

Map Output File

Writable Writable Writable

HierarchicalObject

File on HDFS

User Script

HierarchicalObject

HierarchicalObject

Hive Operator Hive Operator

SerDe

FileFormat / Hadoop Serialization

Mapper Reducer

ObjectInspector

1.0 3 540.2 1 332.2 8 2120.7 2 22

thrift_record<…>thrift_record<…>thrift_record<…>thrift_record<…>

BytesWritable(\x3F\x64\x72\x00)

Java ObjectObject of a Java Class

Standard ObjectUse ArrayList for struct and arrayUse HashMap for map

LazyObjectLazily-deserialized

WritableWritable

WritableWritable

Text(‘1.0 3 54’) // UTF8 encoded

User defined SerDes per ROW

Page 12: Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.

getTypeObjectInspector1

getFieldOI

getStructField

getTypeObjectInspector2

getMapValueOI

getMapValue

deserialize SerDeserialize getOI

SerDe, ObjectInspector and TypeInfo

HierarchicalObject

WritableWritable

Struct

int stringlist

struct

map

string string

HierarchicalObject

String ObjectgetType

ObjectInspector3

TypeInfo

BytesWritable(\x3F\x64\x72\x00)

Text(‘a=av:b=bv 23 1:2=4:5 abcd’)

class HO { HashMap<String, String> a, Integer b, List<ClassC> c, String d;}Class ClassC { Integer a, Integer b;}

List ( HashMap(“a” “av”, “b” “bv”), 23, List(List(1,null),List(2,4),List(5,null)), “abcd”)

int int

HashMap(“a” “av”, “b” “bv”),

HashMap<String, String> a,

“av”