Using Apache Spark and MySQL for Data Analysis
-
Upload
sveta-smirnova -
Category
Software
-
view
146 -
download
1
Transcript of Using Apache Spark and MySQL for Data Analysis
Using Apache Spark and MySQL for Data Analysis
Alexander Rubin, Sveta SmirnovaPerconaFebruary, 4, 2017
www.percona.com
Agenda
• Why Spark?• Spark Examples– Wikistats analysis with Spark
www.percona.com
Data /SQL / Protocol
SQL/ App
What is Spark anyway?
Nodes
Parallel Compute only
Local FS
?
www.percona.com
• In memory processing with caching• Massively Parallel• Direct access to data sources (i.e.MySQL)>>> df = sqlContext.load(source="jdbc",
url="jdbc:mysql://localhost?user=root", dbtable="ontime.ontime_sm”)
• Can store data in Hadoop HDFS / S3 / local Filesystem
• Native Python and R integration
Why Spark?
www.percona.com
Spark vs MySQL
www.percona.com
Spark vs. MySQL for BigData
Indexes
Partitioning
“Sharding”
Full table scan
Partitioning
Map/Reduce
www.percona.com
Spark (vs. MySQL)
• No indexes• All processing is full scan• BUT: distributed and parallel
• No transactions• High latency (usually)
MySQL: 1 query = 1 CPU core
www.percona.com
Indexes (BTree) for Big Data challenge
• Creating an index for Petabytes of data?• Updating an index for Petabytes of data?• Reading a terabyte index?• Random read of Petabyte?
Full scan in parallel is better for big data
www.percona.com
ETL / Pipeline
1. Extract data from external source
2. Transform before loading
3. Load data into MySQL
1. Extract data from external source
2. Load data or rsync to all spark nodes
3. Transform data/Analyze data/Visualize data;
Parallelism
www.percona.com
Schema on Read
Schema on Write• Load data infile will
verify the input (validate)
• … indirect data conversion
• ... or fail if number of cols is wrong
Schema on Read• No “load data” per se,
nothing to validate here
• … Create external table or read csv
• ... will validate on “read”/ select
www.percona.com
Example: Loading wikistat into MySQL
1. Extract data from external source and uncompress!
2. Load data into MySQL and Transform
Wikipedia page counts – download, >10TBload data local infile '$file' into table wikistats.wikistats_full CHARACTER SET latin1 FIELDS TERMINATED BY ' ' (project_name, title, num_requests, content_size) set request_date = STR_TO_DATE('$datestr', '%Y%m%d %H%i%S'), title_md5=unhex(md5(title));
http://dumps.wikimedia.org/other/pagecounts-raw/
www.percona.com
Load timing per hour of wikistat
• InnoDB: 52.34 sec • MyISAM: 11.08 sec (+ indexes)• 1 hour of wikistats =1 minute• 1 year will load in 6 days – (8765.81 hours in 1 year)
• 6 year = > 1 month to load
Not even counting the insert time degradation…
www.percona.com
Loading wikistat as is into Spark
• Just copy files to storage (AWS S3 / local / etc)…– And create SQL structure
• Or read csv, aggregate/filter in Spark and– load the aggregated data into MySQL
www.percona.com
Loading wikistat as is into Spark
• How fast to search? – Depends upon the number of nodes
• 1000 nodes spark cluster– 4.5 TB, 104 Billion records– Exec time: 45 sec– Scanning 4.5TB of data
• http://spark-summit.org/wp-content/uploads/2014/07/Building-1000-node-Spark-Cluster-on-EMR.pdf
www.percona.com
Pipelines: MySQL vs Spark
www.percona.com
Spark and WikiStats: load pipelineRow(project=p[0], url=urllib.unquote(p[1]).lower(), num_requests=int(p[2]), content_size=int(p[3])))
www.percona.com
Save results to MySQL
group_res = sqlContext.sql("SELECT '"+ mydate + "' as mydate,
url, count(*) as cnt, sum(num_requests) as tot_visits
FROM wikistats GROUP BY url")
# Save to MySQLmysql_url="jdbc:mysql://localhost?user=wikistats&password=wikistats”
group_res.write.jdbc(url=mysql_url, table="wikistats.wikistats_by_day_spark", mode="append")
www.percona.com
Multi-Threaded Inserts
www.percona.com
PySpark: CPU
Cpu0 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu1 : 5.7%us, 0.0%sy, 0.0%ni, 92.4%id, 0.0%wa, 0.0%hi, 1.9%si, 0.0%stCpu2 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu3 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu4 : 0.6%us, 0.0%sy, 0.0%ni, 99.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu5 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu6 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu7 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu8 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
...Cpu17 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu18 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu19 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu20 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu21 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu22 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu23 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stMem: 49454372k total, 40479496k used, 8974876k free, 357360k buffers
www.percona.com
Monitoring your jobs
www.percona.com
www.percona.com
mysql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM wikistats_by_day_spark where lower(url) not like '%special%' and lower(url) not like '%page%' and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by lower(url) order by max_visits desc limit 10;+--------------------------------------------------------+------------+----------+| lurl | max_visits | count(*) |+--------------------------------------------------------+------------+----------+| heath_ledger | 4247338 | 131 || cloverfield | 3846404 | 131 || barack_obama | 2238406 | 153 || 1925_in_baseball#negro_league_baseball_final_standings | 1791341 | 11 || the_dark_knight_(film) | 1417186 | 64 || martin_luther_king,_jr. | 1394934 | 136 || deaths_in_2008 | 1372510 | 67 || united_states | 1357253 | 167 || scientology | 1349654 | 108 || portal:current_events | 1261538 | 125 |+--------------------------------------------------------+------------+----------+10 rows in set (1 hour 22 min 10.02 sec)
Search the WikiStats in MySQL10 most frequently queried wiki pages in January 2008
www.percona.com
Search the WikiStats in SparkSQL
spark-sql> CREATE TEMPORARY TABLE wikistats_parquetUSING org.apache.spark.sql.parquetOPTIONS ( path "/ssd/wikistats_parquet_bydate");Time taken: 3.466 seconds
spark-sql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM wikistats_parquet where lower(url) not like '%special%' and lower(url) not like '%page%' and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by lower(url) order by max_visits desc limit 10;heath_ledger 4247335 42cloverfield 3846400 42barack_obama 2238402 531925_in_baseball#negro_league_baseball_final_standings 1791341 11the_dark_knight_(film) 1417183 36martin_luther_king,_jr. 1394934 46deaths_in_2008 1372510 38united_states 1357251 55scientology 1349650 44portal:current_events 1261305 44
Time taken: 1239.014 seconds, Fetched 10 row(s)
10 most frequently queried wiki pages in January 2008
20 min
www.percona.com
Apache Drill
Treat any datasource as a table (even it is not)
Querying MongoDB with SQL
www.percona.com
Magic?
!=
www.percona.com
Recap…
1. Search full dataset• May be pre-filtered• Not aggregated
2. No parallelism
3. Based on index?
4. InnoDB<> Columnar
5. Partitioning?
1. Dataset is already– Filtered (only site=“en”)– Aggregated (group by url)
2. Parallelism (+)
3. Not Based on index
4. Columnar (+)
5. Partitioning (+)
www.percona.com
Thank you!
https://www.linkedin.com/in/alexanderrubin
Alexander Rubin