Real-Time Analytics with MemSQL and Spark

Neil Dahlke, Engineer

2016 November 4

About Me: Neil Dahlke Engineer

MemSQL • real-time database for transactions / analytics

Formerly Globus • high performance data transfer for research scientists

Past talks• Real-time, Geospatial, Maps

Slides: http://www.slideshare.net/MemSQL/realtime-geospatial-maps-by-neil-dahlke

WHAT WEARE SEEING

A WORLD OF CONNECTED MACHINES AND PEOPLE

WHAT WE ARE SEEING:Sensors. Applications. Machines. And us.Generating more data every single day.

By 2020, over 20 billion connected things willbe in use across a range of industries.

REAL-TIMEINPUTS

LIVEOUTPUTS

Sensors

Events

Streaming

Inserts

Upserts

Queries

DashboardsBusiness

Intelligence

Applications

Predict Analytics

WHAT DO REAL TIME BUSINESSES NEED?

FAST DATAINGEST

The volume of data that can be ingested

into the database

LOW LATENCYQUERIES

The time it takes to execute queries and

receive results

HIGHCONCURRENCYThe ability to scale

simultaneous operations

FAST DATAINGEST

The volume of data that can be ingested

into the database

LOW LATENCYQUERIES

The time it takes to execute queries and

receive results

HIGHCONCURRENCYThe ability to scale

simultaneous operations

REAL-TIMEINPUTS

LIVEOUTPUTS

Sensors

Events

Streaming

Inserts

Upserts

Queries

DashboardsBusiness

Intelligence

Applications

Predict Analytics

A massively scalable database and ingest solution allowed for massive growth, real-time analytic applications and faster, targeted.

Kafka• Component we kept

S3 • Persisted all logs to cold storage for eventual analysis

Hadoop• Nighly map-reduce jobs

Redshift• Took a full day to load data from previous day• Reaching overlap of times caused data crisis

Before

No real time access to analytics No SQL interface for analysts and data scientists Massive nightly Hadoop batch jobs (late data) Unfiltered and incomplete data (silos) Expensive

Why was this bad for their business operations?

Why was this bad for their data operations?

Too slow Not scalable No deduplication

• aka not exactly-once Low concurrency

FAST DATAINGEST LOW

LATENCYQUERIES

HIGHCONCURRENCY

How It Works Now

TECHNICAL BENEFITS Instant accuracy to the latest re-pin 1 GB/sec totaling 72 TB/day

THE PINTEREST REAL-TIME ARCHITECTURE

REAL-TIMEANALYTICS

Accelerated ingesttime by 200,000x

1 GB/sec totaling 72 TB/day

RESULTS

Visualizing The Data

Visualizing the Data Demo built using

• Mapbox• Websockets• Tornado web server

When an image is re pinned, the circles on the globe expand, showing higher volume areas

Reads data from MemSQL directly

Questions?

More Info http://www.odbms.org/blog/2015/04/powering-big-data-at-

pinterest-interview-with-krishna-gade/

https://gigaom.com/2015/02/18/pinterest-is-experimenting-with-memsql-for-real-time-data-analytics/

https://www.infoq.com/news/2015/03/pinterest-memsql-spark-streaming

http://blog.memsql.com/pinterest-apache-spark-use-case/ https://

engineering.pinterest.com/blog/real-time-analytics-pinterest

Resources https://github.com/memsql/memsql-spark-connector http://docs.memsql.com/docs/streamliner-administration http://docs.memsql.com/docs/pipelines-overview https://github.com/memsql/memsql-docker-quickstart

Thank You

Real-Time Analytics with MemSQL and Spark

Technology

Transcript of Real-Time Analytics with MemSQL and Spark

Graph Analytics in Spark

Real-Time Analytics with Spark Streaming

Spark Driven Big Data Analytics

Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ... Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics

Empowering PDT Analytics through Databricks & Spark ...

Introducing MemSQL 4

Spark and Online Analytics: Spark Summit East talky by Shubham Chopra

Analytics on Spark & Shark @ Yahoo

Apache Spark and Oracle Stream Analytics

Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette

Analytics with Spark and Cassandra

Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, Anton Gorshkov

Spark Intro @ analytics big data summit

USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

Scaling Analytics with Apache Spark

Blazing Fast Analytics with MongoDB & Spark

Big Data Analytics with Spark

Introduction to big data Analytics using Spark - WestGrid | · PDF file · 2016-04-14Introduction to big data Analytics using Spark . ... Mapreduce: why we need spark Spark Internals

Big data analytics with Spark & Cassandra

IBM Analytics for Apache Spark (Spark as a Service)files.meetup.com/7770922/Spark as a Service.pdfIBM Analytics for Apache Spark –Personas & Practitioners Data Scientist Application