One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

25
Tim Vaillancourt Sr. Technical Operations Architect One Tool to Rule Them All: Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

Transcript of One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

Page 1: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

Tim VaillancourtSr. Technical Operations Architect

One Tool to Rule Them All: Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

Page 2: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

Staring..

as “the RDBMs”

as “the document-store”

as “the in-memory K/V store”

Page 3: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

About Me• Joined Percona in January 2016• Sr Technical Operations Architect for MongoDB• Previous:

• EA DICE (MySQL DBA)• EA SPORTS (Sys/NoSQL DBA Ops)• Amazon/AbeBooks Inc (Sys/MySQL+NoSQL DBA Ops)

• Main techs: MySQL, MongoDB, Cassandra, Solr, Redis, queues, etc• 10+ years tuning Linux for database workloads (off and on)• NOT an Apache Spark expert

Page 4: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

Apache Spark• “…is a fast and general engine for large-scale data processing”• Written in Scala, utilises Akka streaming framework and runs under Java JVM• Supports jobs written in Java, Scala, Python and R• Pluggable datasources for various file types, databases, SaaSs, etc• Fast and efficient: jobs work on datasources quickly in parallel• Optional clustering

• Master/Slave Spark cluster, Zookeeper for Master elections• Slave workers connect to master• Master distributes tasks evenly to available workers

• Streaming and Machine Learning (MLib) capabilities• Programatic and SQL(!) querying capabilities

Page 5: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

Apache Spark: Software Architecture• Jobs go to Cluster Master or runs in

the client JVM directly• Cluster Master directs jobs to nodes

with available resources with messages

• Cluster Master HA• Slaves reconnect to Master• Apache Zookeeper for true HA

Page 6: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

Apache Spark: Hadoop Comparison• Hadoop MapReduce

• Batch-based• Uses data less efficiently• Relatively hard to develop/maintain

• Spark• Stream Processing• Fast/Parallelism

• Prefers memory as much as possible in jobs• Divides work into many lightweight sub-tasks in threads

• Datasources• Uses datasource-awareness to scale (eg: indices, shard-awareness, etc)

• Spark allows processing and storage to scale separately

Page 7: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

Apache Spark: RDDs and DataFrames• RDD: Resilient Distributed Dataset

• Original API to access data in Spark• Lazy: does not access data until a real action is performed• Spark’s optimiser cannot see inside• RDDs are slow on Python

• DataFrames API• Higher level API, focused on the “what” is being done• Has schemas / table-like• Interchangeable Programming and SQL APIs• Much easier to read and comprehend• Optimises execution plan

Page 8: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

Apache Spark: Datasources• Provides a pluggable mechanism for accessing structured data though Spark SQL• At least these databases are supported in some way

• MySQL• MongoDB• Redis• Cassandra• Postgres• HBase• HDFS• File• S3

• In practice: search GitHub, find .jar file, deploy it!

Page 9: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

Apache Spark: SQLContext• ANSI SQL

• 30+ year old language..• Easy to understand• Everyone usually knows it

• Spark SQLContext• A Spark module for structured data processing, wrapping RDD API• Uses the same execution engine as the programatic APIs• Supports:

• JOINs/unions• EXPLAINs,• Subqueries,• ORDER/GROUP/SORT BYs• Most datatypes you’d expect

Page 10: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

Apache Spark: Use Cases• Business Intelligence/Analytics

• Understand • Tip: use dedicated replicas for expensive

queries!• Data Summaries and Batch Jobs

• Perform expensive summaries in the background, save result

• Tip: use burstable/cloud hardware for infrequent batch jobs

• Real-time Stream Processing• Process data as it enters your system

Page 11: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

So why not Apache Drill?• A schema-free SQL engine for Hadoop, NoSQL and Cloud

Storage• Drill does not support / work with

• Relational databases (MySQL) or Redis• No programatic-level querying• No streaming/continuous query functionality

• I don’t know much about it

Page 12: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

The Demo• Scenario: You run a Weather Station data app that stores data in

both an RDBMs and a document store• Goal: summarise weather station data stored in an RDBMs and a

Document store• Min Water Temperature• Avg Water Temperature• Max Water Temperature• Total Sample Count• Get Top-10 (based on avg water temp)

Page 13: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

The Demo• RDBMs: Percona Server for MySQL 5.7

• Stores the Weather station metadata data (roughly 350 stations: ID, name, location, etc)

• Document-Store: Percona Server for MongoDB 3.2• Stores the Weather time-series sample data (roughly 80,000

samples: various weather readings from stations)• In-Memory K/V Store: Redis 2.8

• Store summarised Top-10 data for fast querying of min, avg, max temperature and total sample counts

Page 14: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

The Demo• Apache Spark 1.6.2

Cluster• 1 x Master• 2 x Worker/Slaves

• 1 x Pyspark Job• 1 x Macbook Pro• 3 x Virtualbox VMs• Job submitted on

Master

Page 15: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

The Demo

(Play Demo Video Now)

Page 16: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

The Demo: The Pyspark Job

<- SparkContext and SQLContext

<- MySQL Table as SQLContext Temp Table

Page 17: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

The Demo: The Pyspark Job

<- MongoDB Collection as SQLContext Temp Table

<- New Redis Hash Schema as SQLContext Temp Table

Page 18: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

The Demo: The Pyspark Job

<- From Redis

<- Aggregation

Page 19: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

The Demo: The Pyspark Job<- Aggregation

Page 20: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

The Demo

Page 21: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

The Demo

Page 22: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

The Demo

Page 23: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

The Demo

Page 24: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

Questions?

Page 25: One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

DATABASE PERFORMANCEMATTERS