Download - DIY Analytics with Apache Spark

Transcript
Page 1: DIY Analytics with Apache Spark

DIY ANALYTICS WITH APACHE SPARK

ADAM ROBERTS

London, 22nd June 2017: originally presented at Geecon

Page 2: DIY Analytics with Apache Spark

Important disclaimersCopyright © 2017 by Internatonal Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM.

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.Informaton in these presentatons (including informaton relatng to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of inital publicaton and could include unintentonal technical or typographical errors. IBM shall have no responsibility to update this information. THIS document is distributed "AS IS" without any warranty, either express or implied. In no event shall IBM be liable for any damage arising from the use of this informaton, including but not limited to, loss of data, business interrupton, loss of profit or loss of opportunity. IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided.

Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.

Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operatng environments may vary.

References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business.

Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall consttute legal or other guidance or advice to any individual partcipant or their specific situation. It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identficaton and interpretaton of any relevant laws and regulatory requirements that may affect the customer’s business and any actons the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law.

Information within this presentation is accurate to the best of the author's knowledge as of the 4th of June 2017

Page 3: DIY Analytics with Apache Spark

Informaton concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connecton with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilites of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or implied, including but not limited to, the implied warrantes of merchantability and fitness for a partcular purpose.

The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right.

IBM, the IBM logo, ibm.com, Bluemix, Blueworks Live, CICS, Clearcase, DOORS®, Enterprise Document Management System™, Global Business Services ®, Global Technology Services ®, Informaton on Demand, ILOG, LinuxONE™, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytcs™, PureApplicaton®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Ratonal®, Rhapsody®, SoDA, SPSS, StoredIQ, Tivoli®, Trusteer®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of Internatonal Business Machines Corporaton, registered in many jurisdictons worldwide. Other product and service names might be trademarks of IBM or other companies. Oracle and Java are registered trademarks of Oracle and/or its afiliates. Other names may be trademarks of their respectve owners: and a current list of IBM trademarks is available on the Web at "Copyright and trademark informaton" at www.ibm.com/legal/copytrade.shtml. Apache Spark, Apache Cassandra, Apache Hadoop, Apache Maven, Apache Kafka and any other Apache project mentoned here and the Apache product logos including the Spark logo are trademarks of The Apache Software Foundaton.

Page 4: DIY Analytics with Apache Spark

● Showing you how to get started from scratch:going from “I’ve heard about Spark” to “I can use it for...”

● Worked examples aplenty: lots of code● Not intended to be scientfically accurate! Sharing ideas● Useful reference material● Slides will be hosted

Stick around for...

Page 5: DIY Analytics with Apache Spark

✔ Doing stuf yourself (within your tmeframe and rules)

✔ Findings can be subject to bias: yours don’t have to be

✔ Trust the data instead

Motivation!

Page 6: DIY Analytics with Apache Spark

✔ Finding aliens with the SETI insttute✔ Genomics projects (GATK, Bluemix

Genomics)✔ IBM Watson services

Cool projects involving Spark

Page 7: DIY Analytics with Apache Spark

✔ Powerful machine(s)✔ Apache Spark and a JDK✔ Scala (recommended)✔ Optonal: visualisation library for Spark output e.g. Python with

✔ bokeh✔ pandas

✔ Optonal but not covered here: a notebook bundled with Spark like Zeppelin, or use Jupyter

Your DIY analytcs toolkit

Toolbox from wikimedia: Tanemori derivatve work: יוקיג'אנקי

Page 8: DIY Analytics with Apache Spark

Why listen to me?● Worked on Apache Spark since 2014● Helping IBM customers use Spark for the first tme

● Resolving problems, educatng service teams● Testng on lots of IBM platforms since Spark 1.2: x86, Power, Z systems,

all Java 8 deliverables...● Fixing bugs in Spark/Java: contributng code and helping others to do so● Working with performance tuning pros

● Code provided here has an emphasis on readability!

Page 9: DIY Analytics with Apache Spark

● What is it (why the hype)?● How to answer questons with Spark

● Core spark functons (the “bread and butter” stuf), plotting, correlatons, machine learning

● Built-in utlity functons to make our lives easier (labels, features, handling nulls)

● Examples using data from wearables: two years of actvity

What I'll be covering today

Page 10: DIY Analytics with Apache Spark

Ask me later if you're interested in...● Spark on IBM hardware● IBM SDK for Java specifics● Notebooks● Spark using GPUs/GPUs from Java● Performance tuning● Comparison with other projects● War stories fixing Spark/Java bugs

Page 11: DIY Analytics with Apache Spark

● You know how to write Java or Scala● You’ve heard about Spark but never used it● You have something to process!

What I assume...

Page 12: DIY Analytics with Apache Spark

This talk won’t make you a superhero!

Page 13: DIY Analytics with Apache Spark

● Know more about Spark – what it can/can’t do● Know more about machine learning in Spark● Know that machine learning’s stll hard but in diferent ways

But you will...

Page 14: DIY Analytics with Apache Spark

Open source project (the most actve for big data) offering distributed...

● Machine learning● Graph processing● Core operatons (map, reduce, joins)

● SQL syntax with DataFrames/Datasets

Page 15: DIY Analytics with Apache Spark

✔ Build it yourself from source (requiring Git, Maven, a JDK) or

✔ Download a community built binary or✔ Download our free Spark development package (includes IBM's SDK for Java)

Page 16: DIY Analytics with Apache Spark

Things you can process...● File formats you could use with Hadoop● Anything there’s a Spark package for● json, csv, parquet...

Things you can use with it...● Kafka for streaming● Hive tables● Cassandra as a database● Hadoop (using HDFS with Spark)● DB2!

Page 17: DIY Analytics with Apache Spark

“What’s so good about it then?”

Page 18: DIY Analytics with Apache Spark

● Offers scalability and resiliency ● Auto-compression, fast serialisaton, caching● Python, R, Scala and Java APIs: eligible for Java

based optmisations● Distributed machine learning!

Page 19: DIY Analytics with Apache Spark

“Why isn’t everyone using it?”

Page 20: DIY Analytics with Apache Spark

● Can you get away with using spreadsheet software?● Have you really got a large amount of data?● Data preparation is very important!

How will you properly handle negative, null, or otherwise strange values in your data?

● Will you benefit from massive concurrency?● Is the data in a format you can work with? ● Needs transforming first (and is it worth it)?

Not every problem is a Spark one!

Page 21: DIY Analytics with Apache Spark

● Not really real-tme streaming (“micro-batching”)● Debugging in a largely distributed system with many moving parts can be tough

● Security: not really locked down out of the box (extra steps required by knowledgable users: whole disk encrypton or using other projects, SSL config to do...)

Implementation details...

Page 22: DIY Analytics with Apache Spark

Getting something up and running quickly

Page 23: DIY Analytics with Apache Spark

Run any Spark example in “local mode” first (from “spark”)bin/run-example org.apache.spark.examples.SparkPi 100

Then run it on a cluster you can set up yourself:Add hostnames in conf/slavessbin/start-all.shbin/run-example –master <your_master:7077> ...Check for running Java processes: looking for workers/executors coming and goingSpark UI (default port 8080 on the master)

See: http://spark.apache.org/docs/latest/spark-standalone.html

lib is only with the IBM package

Running something simple

Page 24: DIY Analytics with Apache Spark

And you can use Spark's Java/Scala APIs with bin/spark-shell (a REPL!)

bin/spark-submit

java/scala -cp “$SPARK_HOME/jars/*”

PySpark not covered in this presentation – but fun to experiment with and lots of good docs online for you

Page 25: DIY Analytics with Apache Spark

Increasing the number of threads available for Spark processing in local mode (5.2gb text file) – actually works?

--master local[1] real 3m45.328s

--master local[4] real 1m31.889s

time { echo "--master local[1]" $SPARK_HOME/bin/spark-submit --master local[1] --class MyClass WordCount.jar}

time { echo "--master local[4]" $SPARK_HOME/bin/spark-submit --master local[4] –class MyClass WordCount.jar}

Page 26: DIY Analytics with Apache Spark

“Anything else good about Spark?”

Page 27: DIY Analytics with Apache Spark

● Resiliency by replicaton and lineage tracking● Distribution of processing via (potentally many) workers that can spawn (potentally many) executors

● Caching! Keep data in memory, reuse later● Versatlity and interoperability

APIs include Spark core, ML, DataFrames and Datasets, Streaming and Graphx ...

● Read up on RDDs and ML material by Andrew Ng, Spark Summit videos, deep dives on Catalyst/Tungsten if you want to really get stuck in! This is a DIY talk

Page 28: DIY Analytics with Apache Spark

Recap – we know what it is now...and want to do some

analytics!

Page 29: DIY Analytics with Apache Spark

● Data I’ll process here is for educational purposes only: road_accidents.csv

● Kaggle is a good place to practice – lots of datasets available for you

● Data I'm using is licensed under the Open Government License for public sector information

Page 30: DIY Analytics with Apache Spark

"accident_index","vehicle_reference","vehicle_type","towing_and_articulation","vehicle_manoeuvre","vehicle_location”,restricted_lane","junction_location","skidding_and_overturning","hit_object_in_carriageway","vehicle_leaving_carriageway","hit_object_off_carriageway","1st_point_of_impact","was_vehicle_left_hand_drive?","journey_purpose_of_driver","sex_of_driver","age_of_driver","age_band_of_driver","engine_capacity_(cc)","propulsion_code","age_of_vehicle","driver_imd_decile","driver_home_area_type","vehicle_imd_decile","NUmber_of_Casualities_unique_to_accident_ind ex","No_of_Vehicles_involved_unique_to_accident_index","location_easting_osgr","location_north ing_osgr","longitude","latitude","police_force","accident_severity","number_of_vehicles","number_of_casualties","date","day_of_week","time","local_authority_(district)","local_authority_(highway)","1st_road_class","1st_road_number","road_type","speed_limit","junction_detail","junction_control"," 2nd_road_class","2nd_road_number","pedestrian_crossing-human_control","pedestrian_crossing-physical_facilities","light_conditions","weather_conditions","road_surface_conditions","special_conditions_at_site","carriageway_hazards","urban_or_rural_area","did_police_officer_attend_scene_of_accident","lsoa_of_accident_location","casualty_reference","casualty_class","sex_of_casualty","age_of_casualty","age_band_of_casualty","casualty_severity","pedestrian_location","pedestrian_movement","car_passenger","bus_or_coach_passenger","pedestrian_road_maintenance_worker","casualty_type","casualty_home_area_type","casualty_imd_decile"

Features of the data (“columns”)

Page 31: DIY Analytics with Apache Spark

"201506E098757",2,9,0,18,0,8,0,0,0,0,3,1,6,1,45,7,1794,1,11,-1,1,-1,1,2,384980,394830,-2.227629,53.450014,6,3,2,1,"42250",2,1899-12-3012:56:00,102,"E08000003",5,0,6,30,3,4,6,0,0,0,1,1,1,0,0,1,2,"E01005288",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA, NA,NA

"201506E098766",1,9,0,9,0,8,0,0,0,0,4,1,6,2,25,5,1582, 2,1,-1,-1,-1,1,2,383870,394420,-2.244322,53.446296,6,3,2,1,"14/03/2015",7,1899-12-3015:55:00,102,"E08000003",3,5103,3,40,6,2,5,0,0,5,1,1,1

,0,0,1,1,"E01005178",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA, NA,NA,NA

Values (“rows”)

Page 32: DIY Analytics with Apache Spark

Spark way to figure this out? groupBy* vehicle_type sort** the results on count vehicle_type maps to a code

First place: carDistant second: pedal bikeClose third: van/goods HGV <= 3.5 T Distant last: electric motorcycle

Type of vehicle involved in the most accidents?

Page 33: DIY Analytics with Apache Spark

Different column name this tme, weather_conditons maps to a code again

First place: fine with no high windsSecond: raining, no high windsDistant third: fine, with high winds Distant last: snowing, high winds

groupBy* weather_conditions sort** the results on count weather_conditions maps to a code

What weather should I be avoiding?

Page 34: DIY Analytics with Apache Spark

First place: going ahead (!)Distant second: turning rightDistant third: slowing or stoppingLast: reversing

Spark way... groupBy* manoeuvre sort** the results on count manoeuvre maps to a code

Which manoeuvres should I be careful with?

Page 35: DIY Analytics with Apache Spark

“Why * and **?”

org.apache.spark functions that can run in a distributed manner

Page 36: DIY Analytics with Apache Spark

Spark code example – I'm using Scala● Forced mutability consideration (val or var)● Not mandatory to declare types (or “return ...”)● Check out “Scala for the Intrigued” on YouTube ● JVM based

Scala main method I’ll be usingobject AccidentsExample {

def main(args: Array[String]) : Unit = {

}

}

Which age group gets in the most accidents?

Page 37: DIY Analytics with Apache Spark

Spark entrypointval session = SparkSession.builder().appName("Accidents").master("local[*]")

Creatng a DataFrame: API we’ll use to interact with data as though it’s in an SQL table

val sqlContext = session.getOrCreate().sqlContext

val allAccidents = sqlContext.read.format("com.databricks.spark.csv"). option("header", "true"). load(myHome + "/datasets/road_accidents.csv")

allAccidents.show would give us a table like...

accident_index vehicle_reference vehicle_type towing_and_articulation

201506E098757 2 9 0

201506E098766 1 9 0

Page 38: DIY Analytics with Apache Spark

Group our data and save the result

...

val myAgeDF = groupCountSortAndShow(allAccidents, "age_of_casualty", true)

myAgeDF.coalesce(1). write.option("header", "true"). format("csv"). save("victims")

Runtime.getRuntime().exec("python plot_me.py" )

def groupCountSortAndShow(df: DataFrame, columnName: String, toShow:val ourSortedData = df.groupBy(columnName).count().sort("count")if(toShow) ourSortedData.show()

ourSortedData}

Boolean):DataFrame = {

Page 39: DIY Analytics with Apache Spark

“Hold on...what’s that getRuntime().exec

stuff?!”

Page 40: DIY Analytics with Apache Spark

It’s calling my Python code to plot the CSV fileimport glob, os, pandasfrom bokeh.plotting import figure, output_file, show path = r'victims'

all_files = glob.glob(os.path.join(path, "*.csv"))

df_from_each_file = (pandas.read_csv(f) for f in all_files)

df = pandas.concat(df_from_each_file, ignore_index=True)

plot = figure(plot_width=640,plot_height=640,title='Accident victims by age',

x_axis_label='Age of victim', y_axis_label='How many')

plot.title.text_font_size = '16pt'

plot.xaxis.axis_label_text_font_size = '16pt'

plot.yaxis.axis_label_text_font_size = '16pt'

plot.scatter(x=df.age_of_casualty, y=df['count'])

output_file('victims.html') show(plot)

Page 41: DIY Analytics with Apache Spark

Bokeh gives us a graph like this

Page 42: DIY Analytics with Apache Spark

“What else can I do?”

Page 43: DIY Analytics with Apache Spark

You’ve got some JSON files...•

“Best doom metal band please”

sqlContext.sql("SELECT name, average_rating from bands WHERE " + "genre == 'doom_metal'").sort(desc("average_rating")).show(1)

+--------------------+--------------+| name|average_rating|+--------------------+--------------+|Bugle Infantry| 5|+--------------------+--------------+only showing top 1 row

val bandsDF = sqlContext.read.json(myHome + "/datasets/bands.json") bandsDF.createGlobalTempView("bands")

import org.apache.spark.sql.functions._

{"id":"2","name":"Louder Bill","average_rating":"4.1","genre":"ambient"}{"id":"3","name":"Prey Fury","average_rating":"2","genre":"pop"}{"id":"4","name":"Unbranded Newsroom","average_rating":"4","genre":"rap"}

{"id":"5","name":"Bugle Infantry","average_rating":"5", "genre": "doom_metal"}

{"id":"1","name":"Into Latch","average_rating":"4.9","genre":"doom_metal"}

Randomly generated band names as of May the 18th 2017, zero affiliation on my behalf or IBM’s for any of these names...entirely coincidental if they do exist

Page 44: DIY Analytics with Apache Spark

“Great, but you mentioned some data collected with wearables and machine

learning!”

Page 45: DIY Analytics with Apache Spark

Anonymised data gathered from Automatc, Apple Health, Withings, Jawbone Up

● Car journeys● Sleeping activity (start and end tme)

● Daytme actvity (calories consumed, steps taken)● Weight and heart rate● Several CSV files● Anonymised by subject gatherer before uploading anywhere! Nothing identfiable

Page 46: DIY Analytics with Apache Spark

Exploring the datasets: driving actvity

val autoData = sqlContext.read.format("com.databricks.spark.csv"). option("header", "true").option("inferSchema", "true").load(myHome + "/datasets/geecon/automatic.csv"). withColumnRenamed("End Location Name", "Location"). withColumnRenamed("End Time", "Time")

Page 47: DIY Analytics with Apache Spark

Checking our data is sensible...val colsWeCareAbout = "Distance (mi)","Duration (min)","Fuel Cost (USD)")

for (col <- colsWeCareAbout) { summarise(autoData, col)

}

Array(def summarise(df: DataFrame, columnName: String)

{ averageByCol(df, columnName)minByCol(df, columnName) maxByCol(df, columnName)

}

def averageByCol(df: DataFrame, columnName: String) { println("Printing the average " + columnName) df.agg(avg(df.col(columnName))).show()

}

def minByCol(df: DataFrame, columnName: String) { println("Printing the minimum " + columnName) df.agg(min(df.col(columnName))).show()

}

def maxByCol(df: DataFrame, columnName: String) { println("Printing the maximum " + columnName) df.agg(max(df.col(columnName))).show()

}Average distance (in miles): 6.88, minimum: 0.01, maximum: 187.03Average duration (in minutes): 14.87, minimum: 0.2, maximum: 186.92Average fuel Cost (in USD): 0.58, minimum: 0.0, maximum: 14.35

Page 48: DIY Analytics with Apache Spark

Looks OK - what’s the rate of Mr X visiting a certain place? Got a favourite gym day?

Slacking on certain days?● Using Spark to determine chance of the subject being there● Timestamps (the “Time” column need to become days of the

week instead)● The start of a common theme: data preparaton!

Page 49: DIY Analytics with Apache Spark

Explore the data first

|Vehicle|Start Location Name|Start Time|Location|Time| Distance (mi)|Duration (min)|Fuel Cost (USD)|Average MPG|Fuel Volume (gal)|Hard Accelerations|Hard Brakes|Duration Over 70 mph (secs)|Duration Over 75 mph (secs)| Duration Over 80 mph (secs)|Start Location Accuracy (meters)|End Location Accuracy (meters)|Tags|...

|2005 Nissan0.27|

0|

Sentra| PokeStop 12|4/3/2016 15:06|PokeStop 12|4/3/20160.03|

0|

15:07|1.52| 0.04| 13.64|

0|0|

0|5.0| 5.0|

null||2005 Nissan

0.1|0|

Sentra| PokeStop 12|4/3/2016 15:17|PokeStop 12|4/3/20160.0|

0|

15:18|0.71| 0.01| 17.64|

0|0|

0|5.0| 5.0|

null|

autoData.show() ...

Page 50: DIY Analytics with Apache Spark

val preparedAutoData = sqlContext.sql("SELECT TO_DATE(CAST(UNIX_TIMESTAMP(Time, 'MM/dd/yyyy') AS TIMESTAMP)) as Date, Location, “ +

“date_format(TO_DATE(CAST(UNIX_TIMESTAMP(Time, 'MM/dd/yyyy') AS TIMESTAMP)), 'EEEE') as Day FROM auto_data")

preparedAutoData.show()

Timestamp fun: 4/03/2016 15:06 is no good!

----------+-----------+---------+|2016-04-03|PokeStop 12||2016-04-03|PokeStop 12|

Sunday| Sunday| Sunday||2016-04-03| Michaels|

...

+----------+-----------+---------+

Date| Location | Day|

Page 51: DIY Analytics with Apache Spark

def printChanceLocationOnDay(sqlContext: SQLContext, day: String, location: String) {

val allDatesAndDaysLogged = sqlContext.sql( "SELECT Date, Day " +"FROM prepared_auto_data " +"WHERE Day = '" + day + "'").distinct()

allDatesAndDaysLogged.show()

Scala function: give us all of the rows where the day is what we specified

+----------+------+| Date| Day|+----------+------+|2016-10-17|Monday||2016-10-24|Monday||2016-04-25|Monday||2017-03-27|Monday||2016-08-15|Monday|

...

Page 52: DIY Analytics with Apache Spark

+----------+--------+------+| Date|Location| Day|+----------+--------+------+|2016-04-04||2016-11-14||2017-01-09||2017-02-06|

Gym|Monday| Gym|Monday| Gym|Monday| Gym|Monday|

var rate = Math.floor( (Double.valueOf(allDatesAndDaysLogged.count()) /

Double.valueOf(visits.count())) * 100)

println(rate + "% rate of being at the location '" + location + "' on " + day + ", activity logged for " + allDatesAndDaysLogged + " " + day + "s")

val visits = sqlContext.sql("SELECT * FROM prepared_auto_data " +"WHERE Location = '" + location + "' AND Day = '"

visits.show()

+ day + "'")

Rows where the location and day matches our query (passed in as parameters)

Page 53: DIY Analytics with Apache Spark

● 7% rate of being at the location 'Gym' on Monday, activity logged for 51 Mondays● 1% rate of being at the location 'Gym' on Tuesday, activity logged for 51 Tuesdays● 2% rate of being at the location 'Gym' on Wednesday, activity logged for 49 Wednesdays● 6% rate of being at the location 'Gym' on Thursday, activity logged for 47 Thursdays● 7% rate of being at the location 'Gym' on Saturday, activity logged for 41 Saturdays● 9% rate of being at the location 'Gym' on Sunday, activity logged for 41 Sundays

val days = Array("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")

for (day <- days) {printChanceLocationOnDay(sqlContext, autoData, day, "Gym")

}

Page 54: DIY Analytics with Apache Spark

Which feature(s) are closely related to another - e.g. the time spent asleep?Dataset has these features from Jawbone

● s_duration (the sleep time as well...)● m_active_time ● m_calories● m_distance● m_steps● m_total_calories● n_bedtime (hmm)● n_awake_time

How about correlations?

Page 55: DIY Analytics with Apache Spark

Very strong positive correlation for n_bedtime and s_asleep_time

Correlation between goal_body_weight and s_asleep time: -0.02

Val shouldBeLow = sleepData.stat.corr("goal_body_weight", "s_duration")

println("Correlation between goal body weight and sleep duration: " + shouldBeLow)

val compareToCol = "s_duration"for (col <- sleepData.columns) {

If (! col.equals(compareToCol)) { // don’t compare to itself...val corr = sleepData.stat.corr(col, compareToCol)

if (corr > 0.8) {

println("Very strong positive correlation for " + col + " and " + compareToCol)

} else if (corr >= 0.5) {println("Positive correlation for " + col + " and " + compareToCol)

}}

}

And something we know isn’t related?

Page 56: DIY Analytics with Apache Spark

“...can Spark help me to get a good sleep?”

Page 57: DIY Analytics with Apache Spark

Need to define a good sleep first8 hours for this test subject

If duration is > 8 hoursgood sleep = true, else false

I’m using 1 for true and 0 for falseWe will label this data soon so remember this

Then we’ll determine the most influential features on the value being true or false. This can reveal the interestng stuf!

Page 58: DIY Analytics with Apache Spark

Sanity check first: any good sleeps for Mr X?

Found 538 valid recorded sleep times and 129 were 8 or more hours in duration

// Don't care if the sleep duration wasn't even recorded or it's 0 val onlyRecordedSleeps = onlyDurations.filter($"s_duration" > 0)

println("Found " + onlyRecordedSleeps.count() + " valid recorded " +"sleep times and " + onlyGoodSleeps.count() + " were " + NUM_HOURS + " or more hours in duration")

THRESHOLD = 60 *onlyGoodSleeps =

val onlyDurations = sleepData.select("s_duration")val NUM_HOURS = 8val val

60 * NUM_HOURS onlyDurations.filter($"s_duration" >= THRESHOLD)

Page 59: DIY Analytics with Apache Spark

We will use machine learning: but first...1) What do we want to find out?

Main contributng factors to a good sleep 2) Pick an algorithm3) Prepare the data4) Separate into training and test data5) Build a model with the training data (in parallel using Spark!)6) Use that model on the test data7) Evaluate the model8) Experiment with parameters untl reasonably accurate e.g. N iteratons

Page 60: DIY Analytics with Apache Spark

Alternating Least Squares

K-means (unsupervised learning (no labels, cheap))

Classificaton algorithms such as

Clustering algorithms such as● Produce n clusters from data to determine which cluster a new item can be categorised as● Identfy anomalies: transaction fraud, erroneous data

Recommendaton algorithms such as

● Movie recommendatons on Netlix?● Recommended purchases on Amazon? ● Similar songs with Spotify?● Recommended videos on YouTube?

Logistic regression● Create model that we can use to predict where to plot the next item in a sequence (above or

below our line of best fit)● Healthcare: predict adverse drug reactons based on known interactons with similar drugs● Spam filter (binomial classification)● Naive Bayes

Which algorithms might be of use?

Page 61: DIY Analytics with Apache Spark

What does “Naive Bayes” have to do with my sleep quality?Using evidence provided, guess what a label will be (1 or 0) for us: easy to use with some training data

0 = the label (category 0 or 1) e.g. 0 = low scoring athlete, 1 = high scoring 1:x = the score for a sportng event 1 2:x = the score for a sportng event 2 3:x = the score for a sportng event 3

bayes_data.txt (libSVM format)

Page 62: DIY Analytics with Apache Spark

val model = new NaiveBayes().fit(trainingData)

val predictions = model.transform(testData)

val evaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("accuracy")

val accuracy = evaluator.evaluate(predictions) println("Test set accuracy = " + accuracy)Test set accuracy = 0.82

val bayesData = sqlContext.read.format("libsvm").load("bayes_data.txt")

val Array(trainingData, testData) = bayesData.randomSplit(Array(0.7, 0.3))

Read it in, split it, fit it, transform and evaluate – all on one slide with Spark!

https://spark.apache.org/docs/2.1.0/mllib-naive-bayes.html

Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features. Naive Bayes can be trained very efficiently. Within a single pass to the training data, it computes the conditional probability distribution of each feature given label, and then it applies Bayes’ theorem to compute the conditional probability distribution of label given an observation and use it for prediction.

Page 63: DIY Analytics with Apache Spark

Naive Bayes correctly classifies the data (giving it the right labels)

Feed some new data in for the model...

Page 64: DIY Analytics with Apache Spark

“Can I just use Naive Bayes on all of the sleep data?”

Page 65: DIY Analytics with Apache Spark

1) didn’t label each row in theDataFrame yet2) Naive Bayes can’t handle our data in the current form3) too many useless features

Page 66: DIY Analytics with Apache Spark

Possibilites – bear in mind that DataFrames are immutable, can't modify elements directly...

1) Spark has a .map functon, how about that?“map is a transformation that passes each dataset element through a function and returns a new

RDD representing the results” - http://spark.apache.org/docs/latest/programming-guide.html● Removes all other columns in my case...(new DataFrame with just the labels!)

2) Running a user defined functon on each row?● Maybe, but can Spark’s internal SQL optmiser “Catalyst” see

and optmise it? Probably slow

Labelling each row according to our “good sleep” criteria

Page 67: DIY Analytics with Apache Spark

Preparing the labels

Preparing the features is easier

val labelledSleepData = sleepData. withColumn("s_duration", when(col("s_duration") > THRESHOLD, 1). otherwise(0))

val assembler = new VectorAssembler().setInputCols(sleepData.columns).setOutputCol("features")

val preparedData = assembler.transform(labelledSleepData). withColumnRenamed("s_duration", "good_sleep")

“If duration is > 8 hoursgood sleep = true, else false

I’m using 1 for true and 0 for false”

Page 68: DIY Analytics with Apache Spark

✔ Labelled data now

Page 69: DIY Analytics with Apache Spark

1) didn’t label each row in theDataFrame yet2) Naive Bayes can’t handle our data in the current form3) too many useless features

Page 70: DIY Analytics with Apache Spark

Trying to fit a model to the DataFrame now leads to...

Page 71: DIY Analytics with Apache Spark

s_asleep_time and n_bedtime (integers)

API docs: “Time user fell asleep. Seconds to/from midnight. If negative, subtract from midnight. If positive, add to midnight”

Solution in this example?Change to positives only

Add the number of seconds in a day to whatever s_asleep_time's value is. Think it through properly when you try this if you’re done experimenting and want something reliable to use!

The problem...

Page 72: DIY Analytics with Apache Spark

New DataFrame where negative values are handled

toModel.createOrReplaceTempView("to_model_table")

val preparedSleepAsLabel = preparedData.withColumnRenamed("good_sleep", "label")

val secondsInDay = 24 * 60 * 60

val toModel = preparedSleepAsLabel.withColumn("s_asleep_time", (col("s_asleep_time")) + secondsInDay). withColumn("s_bedtime", (col("s_bedtime")) + secondsInDay)

Page 73: DIY Analytics with Apache Spark

1) didn’t label each row in theDataFrame yet2) Naive Bayes can’t handle our data in the current form3) too many useless features

Page 74: DIY Analytics with Apache Spark

Reducing your “feature space”Spark’s ChiSqSelector algorithm will work hereWe want labels and features to inspect

Page 75: DIY Analytics with Apache Spark

val selector = new ChiSqSelector().setNumTopFeatures(10).setFeaturesCol("features").setLabelCol("good_sleep").setOutputCol("selected_features")

val model = selector.fit(preparedData)

val topFeatureIndexes = model.selectedFeaturesfor (i <- 1 to topFeatureIndexes.length - 1) {// Get col names based on feature indexes println(preparedData.columns(topFeatureIndexes(i)))

}

Using ChiSq selector to get the top features

Feature selection tries to identify relevant features for use in model construction. It reduces the size of the feature space, which can improve both speed and statistical learning behavior. ChiSqSelector implements Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the Chi-Squared test of independence to decide which features to choose. It supports three selection methods: numTopFeatures, percentile, fpr: numTopFeatures chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.

https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html#chisqselector

Page 76: DIY Analytics with Apache Spark

Transform values into a “features” column and only select columns we identified as influentialEarlier we did... toModel.createOrReplaceTempView("to_model_table")

val onlyInterestingColumns = sqlContext.sql("SELECT label, " + colNames.toString() to_model_table")

+ " FROM

val theAssembler = new VectorAssembler().setInputCols(onlyInterestingColumns.columns).setOutputCol("features")

val thePreparedData = theAssembler.transform(onlyInterestingColumns)

Page 77: DIY Analytics with Apache Spark

Top ten influental features (most to least influental)Feature Description from Jawbone API docss_count Number of primary sleep entries loggeds_awake_time Time the user woke ups_quality Proprietary formula, don't knows_asleep_time Time when the user fell asleeps_bedtime Seconds the device is in sleep modes_deep Seconds of main “sound sleep”s_light Seconds of “light sleeps” during the sleep periodm_workout_time Length of logged workouts in secondsn_light Seconds of light sleep during the napn_sound Seconds of sound sleep during the nap

Page 78: DIY Analytics with Apache Spark

1) didn’t label each row in theDataFrame yet2) Naive Bayes can’t handle our data in the current form3) too many useless features

Page 79: DIY Analytics with Apache Spark

And after all that...we can generate predictions!val Array(trainingSleepData, testSleepData)=thePreparedData.randomSplit(Array(0.7, 0.3)

val sleepModel = new NaiveBayes().fit(trainingSleepData)

val predictions = sleepModel.transform(testSleepData)

val evaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("accuracy")

val accuracy = evaluator.evaluate(predictions)println("Test set accuracy for labelled sleep data = " + accuracy)

Test set accuracy for labelled sleep data = 0.81 ...

Page 80: DIY Analytics with Apache Spark

Testing it with new dataval somethingNew = sqlContext.createDataFrame(Seq(

// Good sleep: high workout time, achieved a good amount of deep sleep, went to bed after midnight and woke at almost noon!

(0, Vectors.dense(0, 1, 42600, 100, 87659, 85436, 16138, 22142, 4073, 0)),

// Bad sleep, woke up early (5 AM), didn't get much of a deep sleep, didn't workout, bedtime 10.20 PM

(0, Vectors.dense(0, 0, 18925, 0, 80383, 80083, 6653, 17568, 0, 0))

)).toDF("label","features")

sleepModel.transform(somethingNew).show()

Page 81: DIY Analytics with Apache Spark

Sensible model created with outcomes we’d expectGo to bed earlier, exercise moreI could have looked closer into removing the s_ variables so they’re all m_ and diet informaton; exercise for the reader

Algorithms are producing these outcomes without domain specific knowledge

Page 82: DIY Analytics with Apache Spark

Last example: “does weighing more result in a higher heart rate?”Will get the average of all the heart rates logged on a day when weight was measured

Lower heart rate day = weight was more?Higher rate day = weight was less?

Maybe MLlib again? But all that preparation work...

How deeply involved with Spark do we usually need to get?

Page 83: DIY Analytics with Apache Spark

More data preparaton needed, but there’s a twistHere I use data from two tables: weights, activities

+----------+------+| Date|weight|+----------+------+|2017-04-09||2017-04-08||2017-04-07|

220.4|219.9|221.0|+----------+------+

only showing top 3 rows

becomes

Times are removed as we only care about dates

Page 84: DIY Analytics with Apache Spark

Include only heart beat readings when we have weight(s) measured: join on date used

+----------+------+----------------------+| Date|weight|heart_beats_per_minute|+----------+------+----------------------+|2017-02-13||2017-02-13||2017-02-09||2017-02-09||2017-02-09|

220.3|220.3|215.9|215.9|215.9|

79.0|77.0|97.0|

104.0|88.0|

+----------+------+----------------------...

Page 85: DIY Analytics with Apache Spark

Average the rate and weight readings by day+----------+------+----------------------+| Date|weight|heart_beats_per_minute|+----------+------+----------------------+|2017-02-13| 220.3||2017-02-13| 220.7|

79.0|77.0|

+----------+------+----------------------+...

Should become this:+----------+------+-----------------------------------+| Date|avg weight |avg_heart_beats_per_minute |+----------+------+-----------------------------------+|2017-02-13| 220.5| 78 |+----------+------+-----------------------------------+

...

Page 86: DIY Analytics with Apache Spark

DataFrame now looks like this...+----------+--------------------------- +------------------+|Date ||avg(heart_beats_per_minute)| avg(weight) |+----------+----------------------------+------------------+|2016-04-25||2017-01-06||2016-05-03||2016-07-26|

Something we can quickly plot!

|85.933... |196.46... ||93.8125... |216.0 ||83.647... |198.35... ||84.411... |192.69... |

Page 87: DIY Analytics with Apache Spark

Bokeh used again, no more analysis required

Page 88: DIY Analytics with Apache Spark

Used the same functions as earlier (groupBy, formatting dates) and also a join. Same plotting with different column names. No distinct correlation identified so moved on

Still lots of questions we could answer with Spark using this data● Any impact on mpg when the driver weighs much less than before?● Which fuel provider gives me the best mpg?● Which visited places have a positive effect on subject’s weight?

Page 89: DIY Analytics with Apache Spark

● Analytics doesn’t need to be complicated: Spark’s good for the heavy lifting

● Sometimes best to just plot as you go – saves plenty of time

● Other harder things to worry aboutWriting a distributed machine learning algorithm shouldn’t be one of them!

Page 90: DIY Analytics with Apache Spark

“Which tools can I use to answer my questions?”

This question becomes easier

Page 91: DIY Analytics with Apache Spark

Infrastructure when you’re ready to scale beyond your laptop● Setting up a huge HA cluster: a talk on its own● Who sets up then maintains the machines? Automate it all?● How many machines do you need? RAM/CPUs? ● Who ensures all software is up to date (CVEs?) ● Access control lists?● Hosting costs/providers? ● Reliability, fault tolerance, backup procedures...

Still got to think about...

Page 92: DIY Analytics with Apache Spark

● Use GPUs to train models faster● DeepLearning4J?● Writing your own kernels/C/JNI code (or a Java API like CUDA4J/Aparapi?)

● Use RDMA to reduce network transfer times

● Zero copy: RoCE or InfiniBand?

● Tune the JDK, the OS, the hardware

● Continuously evaluate performance: Spark itself, use● -Xhealthcenter, your own metrics, various libraries...

● Go tackle something huge – join the alien search

● Combine Spark Streaming with MLlib to gain insights fast

● More informed decision making

And if you want to really show off with Spark

Page 93: DIY Analytics with Apache Spark

● Know more about Spark: what it can and can’t do (new project ideas?)

● Know more about machine learning in Spark● Know that machine learning’s stll hard but in diferent ways

Data preparaton, handling junk, knowing what to look forGetting the data in the first place Writng the algorithms to be used in Spark?

Recap – you should now...

Page 94: DIY Analytics with Apache Spark

● Built-in Spark functons are aplenty – try and stck to these● You can plot your results by saving to a csv/json and using your existng favourite plotting libraries easily

● DataFrame (or Datasets) combined with ML = powerful APIs● Filter your data – decide how to handle nulls!● Pick and use a suitable ML algorithm● Plot results

Points to take home...

Page 95: DIY Analytics with Apache Spark

Final points to consider...Where would Spark fit in to your systems? A replacement or

supplementary? Give it a try with your own data and you might be surprised with

the outcomeIt’s free and open source with a very actve community!

Contact me directly: [email protected]

Page 96: DIY Analytics with Apache Spark

Questons?

Page 97: DIY Analytics with Apache Spark

● Automatic: log into the Automatc Dashboard https://dashboard.automatc.com/, on the bottom right, click export, choose what data you want to export (e.g. All)

● Fuelly: (Obtained Gas Cubby), log into the Fuelly Dashboard http://www.fuelly.com/dashboard, select your vehicle in Your Garage, scroll down to vehicle logs, select Export Fuel-ups or Export Services, select duraton of export

● Jawbone: sign into your account at https://jawbone.com/, click on your name on the top right, choose Settings, click on the Accounts tab, scroll down to Download UP Data, choose which year you'd like to download data for

How did I access the data to process?

Page 98: DIY Analytics with Apache Spark

● Withings: log into the Withings Dashboard https://healthmate.withings.com click Measurement table, click the tab corresponding to the data you want to export, click download. You can go here to download all data instead: https://account.withings.com/export/

● Apple: launch the Health app, navigate to the Health Data tab, select your account in the top right area of your screen, select Export Health Data

● Remember to remove any sensitive personal information before sharing/showing/storing said data elsewhere! I am dealing with “cleansed” datasets with no SPI