Using Apache Spark and MySQL for Data Analysis

Using Apache Spark and MySQL for Data Analysis

Alexander Rubin, Sveta SmirnovaPerconaFebruary, 4, 2017

www.percona.com

Agenda

• Why Spark?• Spark Examples– Wikistats analysis with Spark

www.percona.com

Data /SQL / Protocol

SQL/ App

What is Spark anyway?

Nodes

Parallel Compute only

Local FS

?

www.percona.com

• In memory processing with caching• Massively Parallel• Direct access to data sources (i.e.MySQL)>>> df = sqlContext.load(source="jdbc",

url="jdbc:mysql://localhost?user=root", dbtable="ontime.ontime_sm”)

• Can store data in Hadoop HDFS / S3 / local Filesystem

• Native Python and R integration

Why Spark?

www.percona.com

Spark vs MySQL

www.percona.com

Spark vs. MySQL for BigData

Indexes

Partitioning

“Sharding”

Full table scan

Partitioning

Map/Reduce

www.percona.com

Spark (vs. MySQL)

• No indexes• All processing is full scan• BUT: distributed and parallel

• No transactions• High latency (usually)

MySQL: 1 query = 1 CPU core

www.percona.com

Indexes (BTree) for Big Data challenge

• Creating an index for Petabytes of data?• Updating an index for Petabytes of data?• Reading a terabyte index?• Random read of Petabyte?

Full scan in parallel is better for big data

www.percona.com

ETL / Pipeline

1. Extract data from external source

2. Transform before loading

3. Load data into MySQL

1. Extract data from external source

2. Load data or rsync to all spark nodes

3. Transform data/Analyze data/Visualize data;

Parallelism

www.percona.com

Schema on Read

Schema on Write• Load data infile will

verify the input (validate)

• … indirect data conversion

• ... or fail if number of cols is wrong

Schema on Read• No “load data” per se,

nothing to validate here

• … Create external table or read csv

• ... will validate on “read”/ select

www.percona.com

Example: Loading wikistat into MySQL

1. Extract data from external source and uncompress!

2. Load data into MySQL and Transform

Wikipedia page counts – download, >10TBload data local infile '$file' into table wikistats.wikistats_full CHARACTER SET latin1 FIELDS TERMINATED BY ' ' (project_name, title, num_requests, content_size) set request_date = STR_TO_DATE('$datestr', '%Y%m%d %H%i%S'), title_md5=unhex(md5(title));

http://dumps.wikimedia.org/other/pagecounts-raw/

www.percona.com

Load timing per hour of wikistat

• InnoDB: 52.34 sec • MyISAM: 11.08 sec (+ indexes)• 1 hour of wikistats =1 minute• 1 year will load in 6 days – (8765.81 hours in 1 year)

• 6 year = > 1 month to load

Not even counting the insert time degradation…

www.percona.com

Loading wikistat as is into Spark

• Just copy files to storage (AWS S3 / local / etc)…– And create SQL structure

• Or read csv, aggregate/filter in Spark and– load the aggregated data into MySQL

www.percona.com

Loading wikistat as is into Spark

• How fast to search? – Depends upon the number of nodes

• 1000 nodes spark cluster– 4.5 TB, 104 Billion records– Exec time: 45 sec– Scanning 4.5TB of data

• http://spark-summit.org/wp-content/uploads/2014/07/Building-1000-node-Spark-Cluster-on-EMR.pdf

www.percona.com

Pipelines: MySQL vs Spark

www.percona.com

Spark and WikiStats: load pipelineRow(project=p[0], url=urllib.unquote(p[1]).lower(), num_requests=int(p[2]), content_size=int(p[3])))

www.percona.com

Save results to MySQL

group_res = sqlContext.sql("SELECT '"+ mydate + "' as mydate,

url, count(*) as cnt, sum(num_requests) as tot_visits

FROM wikistats GROUP BY url")

# Save to MySQLmysql_url="jdbc:mysql://localhost?user=wikistats&password=wikistats”

group_res.write.jdbc(url=mysql_url, table="wikistats.wikistats_by_day_spark", mode="append")

www.percona.com

Multi-Threaded Inserts

www.percona.com

PySpark: CPU

Cpu0 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu1 : 5.7%us, 0.0%sy, 0.0%ni, 92.4%id, 0.0%wa, 0.0%hi, 1.9%si, 0.0%stCpu2 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu3 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu4 : 0.6%us, 0.0%sy, 0.0%ni, 99.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu5 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu6 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu7 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu8 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

...Cpu17 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu18 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu19 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu20 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu21 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu22 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu23 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stMem: 49454372k total, 40479496k used, 8974876k free, 357360k buffers

www.percona.com

Monitoring your jobs

www.percona.com

www.percona.com

mysql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM wikistats_by_day_spark where lower(url) not like '%special%' and lower(url) not like '%page%' and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by lower(url) order by max_visits desc limit 10;+--------------------------------------------------------+------------+----------+| lurl | max_visits | count(*) |+--------------------------------------------------------+------------+----------+| heath_ledger | 4247338 | 131 || cloverfield | 3846404 | 131 || barack_obama | 2238406 | 153 || 1925_in_baseball#negro_league_baseball_final_standings | 1791341 | 11 || the_dark_knight_(film) | 1417186 | 64 || martin_luther_king,_jr. | 1394934 | 136 || deaths_in_2008 | 1372510 | 67 || united_states | 1357253 | 167 || scientology | 1349654 | 108 || portal:current_events | 1261538 | 125 |+--------------------------------------------------------+------------+----------+10 rows in set (1 hour 22 min 10.02 sec)

Search the WikiStats in MySQL10 most frequently queried wiki pages in January 2008

www.percona.com

Search the WikiStats in SparkSQL

spark-sql> CREATE TEMPORARY TABLE wikistats_parquetUSING org.apache.spark.sql.parquetOPTIONS ( path "/ssd/wikistats_parquet_bydate");Time taken: 3.466 seconds

spark-sql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM wikistats_parquet where lower(url) not like '%special%' and lower(url) not like '%page%' and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by lower(url) order by max_visits desc limit 10;heath_ledger 4247335 42cloverfield 3846400 42barack_obama 2238402 531925_in_baseball#negro_league_baseball_final_standings 1791341 11the_dark_knight_(film) 1417183 36martin_luther_king,_jr. 1394934 46deaths_in_2008 1372510 38united_states 1357251 55scientology 1349650 44portal:current_events 1261305 44

Time taken: 1239.014 seconds, Fetched 10 row(s)

10 most frequently queried wiki pages in January 2008

20 min

www.percona.com

Apache Drill

Treat any datasource as a table (even it is not)

Querying MongoDB with SQL

www.percona.com

Magic?

!=

www.percona.com

Recap…

1. Search full dataset• May be pre-filtered• Not aggregated

2. No parallelism

3. Based on index?

4. InnoDB<> Columnar

5. Partitioning?

1. Dataset is already– Filtered (only site=“en”)– Aggregated (group by url)

2. Parallelism (+)

3. Not Based on index

4. Columnar (+)

5. Partitioning (+)

www.percona.com

Thank you!

https://www.linkedin.com/in/alexanderrubin

Alexander Rubin



Using Apache Spark and MySQL for Data Analysis

Software

Transcript of Using Apache Spark and MySQL for Data Analysis