Using Apache Spark and MySQL for Data Analysis

27
Using Apache Spark and MySQL for Data Analysis Alexander Rubin, Sveta Smirnova Percona February, 4, 2017

Transcript of Using Apache Spark and MySQL for Data Analysis

Page 1: Using Apache Spark and MySQL for Data Analysis

Using Apache Spark and MySQL for Data Analysis

Alexander Rubin, Sveta SmirnovaPerconaFebruary, 4, 2017

Page 2: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Agenda

• Why Spark?• Spark Examples– Wikistats analysis with Spark

Page 3: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Data /SQL / Protocol

SQL/ App

What is Spark anyway?

Nodes

Parallel Compute only

Local FS

?

Page 4: Using Apache Spark and MySQL for Data Analysis

www.percona.com

• In memory processing with caching• Massively Parallel• Direct access to data sources (i.e.MySQL)>>> df = sqlContext.load(source="jdbc",

url="jdbc:mysql://localhost?user=root", dbtable="ontime.ontime_sm”)

• Can store data in Hadoop HDFS / S3 / local Filesystem

• Native Python and R integration

Why Spark?

Page 5: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Spark vs MySQL

Page 6: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Spark vs. MySQL for BigData

Indexes

Partitioning

“Sharding”

Full table scan

Partitioning

Map/Reduce

Page 7: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Spark (vs. MySQL)

• No indexes• All processing is full scan• BUT: distributed and parallel

• No transactions• High latency (usually)

MySQL: 1 query = 1 CPU core

Page 8: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Indexes (BTree) for Big Data challenge

• Creating an index for Petabytes of data?• Updating an index for Petabytes of data?• Reading a terabyte index?• Random read of Petabyte?

Full scan in parallel is better for big data

Page 9: Using Apache Spark and MySQL for Data Analysis

www.percona.com

ETL / Pipeline

1. Extract data from external source

2. Transform before loading

3. Load data into MySQL

1. Extract data from external source

2. Load data or rsync to all spark nodes

3. Transform data/Analyze data/Visualize data;

Parallelism

Page 10: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Schema on Read

Schema on Write• Load data infile will

verify the input (validate)

• … indirect data conversion

• ... or fail if number of cols is wrong

Schema on Read• No “load data” per se,

nothing to validate here

• … Create external table or read csv

• ... will validate on “read”/ select

Page 11: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Example: Loading wikistat into MySQL

1. Extract data from external source and uncompress!

2. Load data into MySQL and Transform

Wikipedia page counts – download, >10TBload data local infile '$file' into table wikistats.wikistats_full CHARACTER SET latin1 FIELDS TERMINATED BY ' ' (project_name, title, num_requests, content_size) set request_date = STR_TO_DATE('$datestr', '%Y%m%d %H%i%S'), title_md5=unhex(md5(title));

http://dumps.wikimedia.org/other/pagecounts-raw/

Page 12: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Load timing per hour of wikistat

• InnoDB: 52.34 sec • MyISAM: 11.08 sec (+ indexes)• 1 hour of wikistats =1 minute• 1 year will load in 6 days – (8765.81 hours in 1 year)

• 6 year = > 1 month to load

Not even counting the insert time degradation…

Page 13: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Loading wikistat as is into Spark

• Just copy files to storage (AWS S3 / local / etc)…– And create SQL structure

• Or read csv, aggregate/filter in Spark and– load the aggregated data into MySQL

Page 14: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Loading wikistat as is into Spark

• How fast to search? – Depends upon the number of nodes

• 1000 nodes spark cluster– 4.5 TB, 104 Billion records– Exec time: 45 sec– Scanning 4.5TB of data

• http://spark-summit.org/wp-content/uploads/2014/07/Building-1000-node-Spark-Cluster-on-EMR.pdf

Page 15: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Pipelines: MySQL vs Spark

Page 16: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Spark and WikiStats: load pipelineRow(project=p[0], url=urllib.unquote(p[1]).lower(), num_requests=int(p[2]), content_size=int(p[3])))

Page 17: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Save results to MySQL

group_res = sqlContext.sql("SELECT '"+ mydate + "' as mydate,

url, count(*) as cnt, sum(num_requests) as tot_visits

FROM wikistats GROUP BY url")

# Save to MySQLmysql_url="jdbc:mysql://localhost?user=wikistats&password=wikistats”

group_res.write.jdbc(url=mysql_url, table="wikistats.wikistats_by_day_spark", mode="append")

Page 18: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Multi-Threaded Inserts

Page 19: Using Apache Spark and MySQL for Data Analysis

www.percona.com

PySpark: CPU

Cpu0 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu1 : 5.7%us, 0.0%sy, 0.0%ni, 92.4%id, 0.0%wa, 0.0%hi, 1.9%si, 0.0%stCpu2 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu3 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu4 : 0.6%us, 0.0%sy, 0.0%ni, 99.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu5 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu6 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu7 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu8 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

...Cpu17 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu18 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu19 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu20 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu21 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu22 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu23 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stMem: 49454372k total, 40479496k used, 8974876k free, 357360k buffers

Page 20: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Monitoring your jobs

Page 21: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Page 22: Using Apache Spark and MySQL for Data Analysis

www.percona.com

mysql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM wikistats_by_day_spark where lower(url) not like '%special%' and lower(url) not like '%page%' and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by lower(url) order by max_visits desc limit 10;+--------------------------------------------------------+------------+----------+| lurl | max_visits | count(*) |+--------------------------------------------------------+------------+----------+| heath_ledger | 4247338 | 131 || cloverfield | 3846404 | 131 || barack_obama | 2238406 | 153 || 1925_in_baseball#negro_league_baseball_final_standings | 1791341 | 11 || the_dark_knight_(film) | 1417186 | 64 || martin_luther_king,_jr. | 1394934 | 136 || deaths_in_2008 | 1372510 | 67 || united_states | 1357253 | 167 || scientology | 1349654 | 108 || portal:current_events | 1261538 | 125 |+--------------------------------------------------------+------------+----------+10 rows in set (1 hour 22 min 10.02 sec)

Search the WikiStats in MySQL10 most frequently queried wiki pages in January 2008

Page 23: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Search the WikiStats in SparkSQL

spark-sql> CREATE TEMPORARY TABLE wikistats_parquetUSING org.apache.spark.sql.parquetOPTIONS ( path "/ssd/wikistats_parquet_bydate");Time taken: 3.466 seconds

spark-sql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM wikistats_parquet where lower(url) not like '%special%' and lower(url) not like '%page%' and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by lower(url) order by max_visits desc limit 10;heath_ledger 4247335 42cloverfield 3846400 42barack_obama 2238402 531925_in_baseball#negro_league_baseball_final_standings 1791341 11the_dark_knight_(film) 1417183 36martin_luther_king,_jr. 1394934 46deaths_in_2008 1372510 38united_states 1357251 55scientology 1349650 44portal:current_events 1261305 44

Time taken: 1239.014 seconds, Fetched 10 row(s)

10 most frequently queried wiki pages in January 2008

20 min

Page 24: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Apache Drill

Treat any datasource as a table (even it is not)

Querying MongoDB with SQL

Page 25: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Magic?

!=

Page 26: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Recap…

1. Search full dataset• May be pre-filtered• Not aggregated

2. No parallelism

3. Based on index?

4. InnoDB<> Columnar

5. Partitioning?

1. Dataset is already– Filtered (only site=“en”)– Aggregated (group by url)

2. Parallelism (+)

3. Not Based on index

4. Columnar (+)

5. Partitioning (+)

Page 27: Using Apache Spark and MySQL for Data Analysis

www.percona.com

Thank you!

https://www.linkedin.com/in/alexanderrubin

Alexander Rubin