Integrating Elastic and Apache Spark - Elastic London Meetup (2015-09-24)

22
© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved. Neil Andrassy CTO at The Filter @andrassy 24 September 2015 @ Elastic London Meetup

Transcript of Integrating Elastic and Apache Spark - Elastic London Meetup (2015-09-24)

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

Neil Andrassy – CTO at The Filter

@andrassy

24 September 2015 @ Elastic London Meetup

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

THE FILTER

IN MEDIA AND IN RETAIL, WE AIM TO UNDERSTAND…

• HOW STUFF RELATES TO OTHER STUFF

• HOW PEOPLE RELATE TO STUFF

GIVING USERS THE RIGHT “STUFF” AT THE RIGHT TIME…

• ALTERNATIVE PRODUCTS

• COHERENT PERSONALISED PLAYLISTS

• PRODUCTS YOU MIGHT LIKE

• CONTENT RELATED TO THIS PRODUCT

• RELEVANT NEWS

• ….AND MANY MORE

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

OUR HERITAGE

“The Filter is like a zen master, who knows me, knows what I am interested in, knows what’s out there and gives me what is relevant at the time that I really want it in the most appropriate way.”

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

OUR CHALLENGES

PERSONALISED IS HARD…

• READ SCALABILITY

• WRITE SCALABILITY (REALTIME PERSONALISATION FOR THE INDIVIDUAL)

• AVAILABILITY / FAULT TOLERANCE

MACHINE LEARNING IS HARD…

• DATA HUNGRY

• VOLUME – VELOCITY – VARIETY

• ML PROCESSES ARE RESOURCE INTENSIVE

MULTI-TENANCY IS HARD

• EVERY CATALOGUE IS UNIQUE / DIFFERENT

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

OUR DATA JOURNEY (pre Spark)

2004 – MS SQL

2011 – MS SQL + MONGODB

2012 - ELASTIC + MS SQL

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

ELASTIC good….

• STRUCTURED DATA

• UNSTRUCTURED DATA

• TIME-SERIES DATA

• READ SCALABILITY

• WRITE SCALABILITY

• SUPPORT FOR FAILURE

• EASY MANAGEMENT

• ….

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

ELASTIC not so good….

• DATA PROCESSING

• ETL

• BATCH

• STREAMS

• MACHINE LEARNING

• GRAPH

• ….

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

OUR DATA JOURNEY (continued)

2004 – MS SQL

2011 – MS SQL + MONGODB

2012 - ELASTIC + MS SQL

2014 – ELASTIC + SPARK

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

APACHE SPARK is…. A fast and general purpose engine for large-scale data processing

• SPEEDY – faster than Hadoop

• EASY TO USE API – Java, Scala, Python, R

• SCALABLE – makes clustered operation transparent

• FLEXIBLE / POWERFUL COMPONENTS

• CORE

• SQL

• STREAMING

• MLLIB

• GRAPHX

• ***ELASTICSEARCH-SPARK*** https://spark.apache.org

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

CLIENT, MASTER, WORKER

• CLIENT (DRIVER) – submits a job to the MASTER

• MASTER (MANAGER) – co-ordinates the job with the WORKERS

• WORKERS – “do” the actual work/tasks (on RDDs)

• Ideally co-locate these on ES data nodes

• Workers manage local executors (per app)

https://spark.apache.org

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

RDD Resilient Distributed Dataset

• An IMMUTABLE collection of data elements

• PARTITIONED for distributed processing (think SHARD)

• RESILIENT for failure tolerance / recovery

IMMUTABLE + PARTITIONED

PARALLELIZABLE + SCALABLE

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

ELASTIC-SPARK Part of ELASTICSEARCH-HADOOP – connects Spark with Elastic

• Support for READ and WRITE

• Support for SQL

• RDD partitioning and ES shards work together…

• PARTITION PER SHARD

• PARTITION / SHARD LOCALITY PREFERRED

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

DEMOS

Using road safety data from https://data.gov.uk/

• Accidents

• Casualties

Scala language – expressive and natural fit for parallel workloads /

RDD (but Java, Python etc. also available).

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

DEMO 1: LOAD CSV

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

Index from CSV – 1) Setup SparkContext

//Spark Core

import org.apache.spark.{SparkConf, SparkContext}

//ElasticSearch-Spark

import org.elasticsearch.spark._

//Configure and create a Spark context for our work

def InitialiseSparkContext: SparkContext = {

val sparkConfig = new SparkConf()

.setMaster("local[4]") //Run locally with 4 workers - can scale out easily later

.setAppName("Accident data loader") //Friendly job/app name

.set("es.index.auto.create", "true") //Optional job/app level ES settings

return new SparkContext(sparkConfig)

}

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

Index from CSV – 2) Load from file and prepare

// Read the CSV file (distributed / parallel iterable set of file lines

val csvRdd = sc.textFile(sourceFileName)

// Split and clean ALL the text file rows -> iterbale set of string[]

val headerAndRowsRdd = csvRdd.map(line => line.split(",").map(_.trim))

// Get headers – single row enumerable of string[] – broadcast to all partitions / workers as needed

val headerRdd = headerAndRowsRdd.first()

// Create a set of all data *except* header

val rowDataRdd = headerAndRowsRdd.filter(_(0) != headerRdd(0))

// Zip together headers and data into a iterable set of Maps (e.g. key->value)

val finalMapsRdd = rowDataRdd.map(rowValues => headerRdd.zip(rowValues).toMap)

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

Index from CSV – 3) Save to ES

//Finally, save it - no work *actually* executes until this line...

finalMaps.saveToEs(s"$destinationIndex/$destinationType")

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

DEMO 2: RE-INDEX A TYPE

Create target index and mappings as required and then…

Or, alternatively, take more control…

sparkContext.esRDD(s"$sourceIndex/$sourceType").saveToEs(s"$destIndex/$destType")

//Load data

val sourceRdd = sc.esRDD(s"$sourceIndex/$sourceType")

//Save to ES, extracting parent ID from source data

sourceRdd.saveToEs(s"$destIndex/$destType",Map("es.mapping.parent" -> "Accident_Index"))

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

DEMO 3: SQL / JOIN

1 - Create a SQLContext from a SparkContext…

2 - Declare tables

import org.apache.spark.sql.SQLContext

// sparkContext = existing SparkContext

val sqlContext = new SQLContext(sc)

sqlContext.sql(

"CREATE TEMPORARY TABLE accident " +

"USING org.elasticsearch.spark.sql " +

s"OPTIONS (resource '${indexName}_reindex/accident', pushdown 'true')")

sqlContext.sql(

"CREATE TEMPORARY TABLE casualty " +

"USING org.elasticsearch.spark.sql " +

s"OPTIONS (resource '${indexName}_reindex/casualty', pushdown 'true')")

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

DEMO 3: SQL / JOIN

3 – Query the tables into a DataFrame (effectively a SQL RDD)

4 – Collect results on client (be careful – don’t collect HUGE

datasets!)

val joinedDataFrame = sqlContext.sql(

"""SELECT

| `1st_Road_Class`,

| COUNT(a.Accident_Index) as count_Casualty,

| AVG(c.Age_of_Casualty) as avg_Age_of_Casualty,

| AVG(c.Casualty_Severity) as avg_Casualty_Severity

|FROM accident a

|INNER JOIN casualty c

| ON c.Accident_Index = a.Accident_Index

|WHERE c.Casualty_Severity < 3

|GROUP BY `1st_Road_Class`

|ORDER BY `1st_Road_Class`""".stripMargin

)

//Collect pulls the final result data back from the workers to the client

joinedData.collect().foreach(println)

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.

VERSIONS

• Elastic 1.7.2

• Apache Spark 1.5

• Scala 2.11.7

• ElasticSearch-Spark v2.2.0-m1

//SBT

libraryDependencies += ("org.apache.spark" %% "spark-core" % "1.5.0")

libraryDependencies += ("org.apache.spark" %% "spark-sql" % "1.5.0")

libraryDependencies += ("org.elasticsearch" %% "elasticsearch-spark" % "2.2.0-m1")

© Exabre Limited 2015. The three bar ‘F’ device, “The Filter” and “Filter Systems” are trademarks or registered trademarks of Exabre Limited. All rights reserved.