Wisely Chen Spark Talk At Spark Gathering in Taiwan

31
SparkSQL and Parquet Wisely Chen Data Tech Lead at Appier

Transcript of Wisely Chen Spark Talk At Spark Gathering in Taiwan

Page 1: Wisely Chen Spark Talk At Spark Gathering in Taiwan

SparkSQL and ParquetWisely Chen

Data Tech Lead at Appier

Page 2: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Agenda

• Introduce me and Appier

• How do we build our pipeline?

• Why do we use SparkSQL + HDFS?

• Why do we use Parquet?

Page 3: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Who am I?• Data Team Lead at Appier

• Spark Code Contributor

• Personal Email: [email protected]

• Speaker at

• Spark Summit 2014 SF

• Hadoop Summit 2013 San Jose

• Jenkins Conf 2013 Palo Alto

Page 4: Wisely Chen Spark Talk At Spark Gathering in Taiwan

What is Appier?

• AI and Data Company

• Mission is to make advertisement the preferred content that connects business and users

• Back by Sequoia Capital

Page 5: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Data Team in Appier• Deal with Perabyte per day

• Handling 2K~3K cores cluster on AWS

• Build and maintain a robust data pipeline

• Data correctness is must

• Partial pipeline need < 1min latency

• Total infra need low cost

Page 6: Wisely Chen Spark Talk At Spark Gathering in Taiwan

How do we do that?

Page 7: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Architecture

Log Kafka Spark Streaming

ETLS3

HDFS

ParquetSparkSQL

ML

Page 8: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Heavy Spark User

• ML : Custom Spark Application(no mllib)

• ETL: Spark Application

• SQL: SparkSQL + Parquet

• Streaming: Spark Streaming + Kafka

Page 9: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Why Spark? • We love spark and familiar with Spark

• Appier commit >10 commits in last Quater

• Perfect for ML application

• A general engine for every aspect usage

• You don’t have to learn a lot of big data term

Page 10: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Why SQL is important?

Before SparkSQL

5 engineer coding scala

After SparkSQL

All engineer can involved into data project

Data analytics can query data on their own

Page 11: Wisely Chen Spark Talk At Spark Gathering in Taiwan

User Interface

SQL + TimeRange

File Util SQL Engine

File List

HDFS

Parquet

S3

Parquet

Page 12: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Why SparkSQL?• We know Spark

• Tuning Spark Application knowledge can be reused in SparkSQL

• Any table/UDF defined in SparkSQL application can be reused in ML application

• SparkSQL and Dataframe will be more important in Spark eco-system

Page 13: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Which storage is best for SparkSQL in Appier?

Page 14: Wisely Chen Spark Talk At Spark Gathering in Taiwan

We try Cassandra• Pros

• Easy to use and implement application

• Easy to scale up

• Hide all heavy stuff inside the platform

• Cons

• Not so easy to maintain

• Not so easy to tune performance

• Hide all heavy stuff inside the platform

Page 15: Wisely Chen Spark Talk At Spark Gathering in Taiwan

We try AeroSpike• Pros

• Very good performance

• Easy to maintain

• Easy to scale

• Hide all heavy stuff inside the platform but better implement

• Cons

• Expensive!!!!!!

Page 16: Wisely Chen Spark Talk At Spark Gathering in Taiwan

HDFS + File• Pros

• Low cost

• Good read and write performance on big data

• HDFS is very stable

• We know all the detail

• Easy to scale up

• Cons

• We have to implement all the detail

• We have to implement all the maintain script

Page 17: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Why do we give up AeroSpike?

• Cost is too high

• We prefer put money on people rather than machine

Page 18: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Why do we give up Cassandra?

• We are not familiar with Cassandra(Main Reason)

• Very easy to implement POC

• Reduce a lot of effort on start phase

• We feel Hard to maintain on later phase (again : we are not familiar with Cassandra)

Page 19: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Why do we use HDFS/File?

• Cost is cheap

• Implement need a lot of time

• Solid engineering team don’t afraid this

• We can control all detail

• We can build up a maintainable platform

Page 20: Wisely Chen Spark Talk At Spark Gathering in Taiwan

The main reason

• We love Spark

• I have used with HDFS before. But I tend to love HDFS after these days

Page 21: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Why Parquet?

Page 22: Wisely Chen Spark Talk At Spark Gathering in Taiwan

What is Parquet?

• From Google Dremel paper

• Column format storage

• Support nested data structures(List,Map..)

• Support Protobuf/thrift/Json

Page 23: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Column Storage

Page 24: Wisely Chen Spark Talk At Spark Gathering in Taiwan

ID Name Age

1 Alice 23

2 Beverly 32

3 Cate 15

Select Name From xxx

ID Name Age

1 Alice 23

2 Beverly 32

3 Cate 15

Select * From xxx where Age > 20

Column Pruning Predicate Pushdown

Page 25: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Different Encoding

Encoding Algo Use Case

Run Length Encoding Repeated Data

Delta Encoding Sequence Data with order (Timestamp,auto create id…)

Dictionary Encoding Small scale data set(IP…)

Prefix Encoding Delta Encoding for strings

Page 26: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Storage

Page 27: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Language Independent

Page 28: Wisely Chen Spark Talk At Spark Gathering in Taiwan

The real reason is

• SparkSQL treat Parquet/JSON as first citizen

• ORC, RCFile is not on their plan

• Parquet perform well in every aspect

Page 29: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Good Lesson we learn• File(Parquet) is better storage than any other DB

• Easily to backup, replication

• Easily to change storage solution

• Easy to debug

• Easy to maintain

Page 30: Wisely Chen Spark Talk At Spark Gathering in Taiwan

Conclusion• Spark Spark Spark

• SparkSQL + Parquet is very good combine solution

• Don’t trust any solution / service. Don’t put any critical service on the platform you don’t trust

• A solid team can do anything you want

Page 31: Wisely Chen Spark Talk At Spark Gathering in Taiwan

We are hiring