Post on 14-Dec-2015
Berkley Data Analysis StackShark, Bagel
2
Previous Presentation Summary• Mesos, Spark, Spark Streaming
Infrastructure
Storage
Data Processing
Application
Resource Management
Data Management
Share infrastructure across frameworks(multi-programming for datacenters)
Efficient data sharing across frameworks
Data Processing• in-memory processing • trade between time, quality, and cost
ApplicationNew apps: AMP-Genomics, Carat, …
3
Previous Presentation Summary• Mesos, Spark, Spark Streaming
Spark Example: Log Mining• Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
cachedMsgs = messages.cache()Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Cached RDDParallel operation
Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
Logistic Regression Performance127 s / iteration
first iteration 174 s
further iterations 6 s
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {…
}
println("Final w: " + w)
HIVE: Components
HDFS
Hive CLI
DDLQueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDeThrift Jute JSON..
Execution
Hive QL
Parser
Planner
Mgm
t. W
eb
UI
Data Model
Hive Entity
Sample Metastore Entity
Sample HDFS Location
Table T /wh/T
Partition date=d1 /wh/T/date=d1
Bucketing column
userid
/wh/T/date=d1/part-0000…/wh/T/date=d1/part-1000(hashed on userid)
External Table
extT/wh2/existing/dir(arbitrary location)
Hive/Shark flowchart (Insert into table)Two ways to do this.
1. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS.2. Load “Buckets” directly. The user is responsible for creating buckets.
CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE;
Creates the table directory.
Hive/Shark flowchart (Insert into table)Two ways to do this.
1. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS.
CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12' STORED AS TEXTFILE LOCATION '/user/data/staging/page_view';
hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view
Step 1
Step 2
Hive/Shark flowchart (Insert into table)Two ways to do this.
1. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS.
Step 3
FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US';
Hive
File on HDFS
HierarchicalObject
Writable
Stream Stream
HierarchicalObject
Map Output File
Writable Writable Writable
HierarchicalObject
File on HDFS
User Script
HierarchicalObject
HierarchicalObject
Hive Operator Hive Operator
SerDe
FileFormat / Hadoop Serialization
Mapper Reducer
ObjectInspector
1.0 3 540.2 1 332.2 8 2120.7 2 22
thrift_record<…>thrift_record<…>thrift_record<…>thrift_record<…>
BytesWritable(\x3F\x64\x72\x00)
Java ObjectObject of a Java Class
Standard ObjectUse ArrayList for struct and arrayUse HashMap for map
LazyObjectLazily-deserialized
WritableWritable
WritableWritable
Text(‘1.0 3 54’) // UTF8 encoded
User defined SerDes per ROW
getTypeObjectInspector1
getFieldOI
getStructField
getTypeObjectInspector2
getMapValueOI
getMapValue
deserialize SerDeserialize getOI
SerDe, ObjectInspector and TypeInfo
HierarchicalObject
WritableWritable
Struct
int stringlist
struct
map
string string
HierarchicalObject
String ObjectgetType
ObjectInspector3
TypeInfo
BytesWritable(\x3F\x64\x72\x00)
Text(‘a=av:b=bv 23 1:2=4:5 abcd’)
class HO { HashMap<String, String> a, Integer b, List<ClassC> c, String d;}Class ClassC { Integer a, Integer b;}
List ( HashMap(“a” “av”, “b” “bv”), 23, List(List(1,null),List(2,4),List(5,null)), “abcd”)
int int
HashMap(“a” “av”, “b” “bv”),
HashMap<String, String> a,
“av”