Introduction to Apache Spark - Meetupfiles.meetup.com/3576292/Dubravko Dulic SparkR June...

14
dsquare.de Salzburger Straße 27 83073 Stephanskirchen Tel.: 08031-234 1140 Mobil: 0172-1484 731 Email: [email protected] www.dsquare.de Rosenheim, 11. Juli 2016 Introduction to Apache Spark

Transcript of Introduction to Apache Spark - Meetupfiles.meetup.com/3576292/Dubravko Dulic SparkR June...

Page 1: Introduction to Apache Spark - Meetupfiles.meetup.com/3576292/Dubravko Dulic SparkR June 2016.pdf · What is Apache Spark? 4 A cluster-based computing engine Developed since 2012

dsquare.deSalzburger Straße 27

83073 Stephanskirchen

Tel.: 08031-234 1140

Mobil: 0172-1484 731

Email: [email protected]

www.dsquare.de

Rosenheim, 11. Juli 2016

Introduction to Apache Spark

Page 2: Introduction to Apache Spark - Meetupfiles.meetup.com/3576292/Dubravko Dulic SparkR June 2016.pdf · What is Apache Spark? 4 A cluster-based computing engine Developed since 2012

Apache Spark according to Google Trends

2

Page 3: Introduction to Apache Spark - Meetupfiles.meetup.com/3576292/Dubravko Dulic SparkR June 2016.pdf · What is Apache Spark? 4 A cluster-based computing engine Developed since 2012

TDWI Conference sparked interest in Spark

3

Page 4: Introduction to Apache Spark - Meetupfiles.meetup.com/3576292/Dubravko Dulic SparkR June 2016.pdf · What is Apache Spark? 4 A cluster-based computing engine Developed since 2012

What is Apache Spark?

4

A cluster-based computing engine

Developed since 2012

Developed by students at UC Berkley

APIs for

Python

Java

R

Scala

Supports SQL, ML, Streaming Data, Graph processing

Faster than Hadoops Map-Reduce

Page 5: Introduction to Apache Spark - Meetupfiles.meetup.com/3576292/Dubravko Dulic SparkR June 2016.pdf · What is Apache Spark? 4 A cluster-based computing engine Developed since 2012

Timeline

5

Since late 1990s

APPLY Functions

In-memory

Single Process

Single Core

Not Scalable

Since 2007

Map-Reduce

Parallel Computing

Distributed File System

Linear Scalability

Since 2009

Directed acyclic graph

Parallel Computing

Distributed File System

Linear Scalability

Page 6: Introduction to Apache Spark - Meetupfiles.meetup.com/3576292/Dubravko Dulic SparkR June 2016.pdf · What is Apache Spark? 4 A cluster-based computing engine Developed since 2012

Map-reduce vs. Spark

6

Map-reduce

Directed acyclic graph

No writeback to HDFS necessary

Data passed to next processing step

Developer focused

Transformations available

Many APIs

In-Memory processing

RDD materialized in memory across

cluster

No need to reload from disc

Page 7: Introduction to Apache Spark - Meetupfiles.meetup.com/3576292/Dubravko Dulic SparkR June 2016.pdf · What is Apache Spark? 4 A cluster-based computing engine Developed since 2012

Spark is well suited for the needs of Data Scientists

7

Iterative application of

algorithms

Multiple passes over data sets

Reactive applications

Page 8: Introduction to Apache Spark - Meetupfiles.meetup.com/3576292/Dubravko Dulic SparkR June 2016.pdf · What is Apache Spark? 4 A cluster-based computing engine Developed since 2012

SparkR

Spark can unify an analytical environment

8

DB DBETL

Data Storage ETL using SQL,

SAS or else

Data Mart/local

Storage/Analytical

Environment

Retrieval

SQL/other

Language based

Data retrieval

Analysis

RE

PL

= R

ea

d E

va

lua

teP

rint L

oo

p (b

ac

k)

Page 9: Introduction to Apache Spark - Meetupfiles.meetup.com/3576292/Dubravko Dulic SparkR June 2016.pdf · What is Apache Spark? 4 A cluster-based computing engine Developed since 2012

RDD

9

Col1 Col2 Col3

Item 1 Item 4 Item 7

Item 2 Item 5 Item 8

Item 3 Item 6 Item 9

Data

Worker Nodes

Worker nodes:

They Cache the data and do the

(lazy) evaluation.

This could be an

RDD = Resiliant

Distributed Dataset

Page 10: Introduction to Apache Spark - Meetupfiles.meetup.com/3576292/Dubravko Dulic SparkR June 2016.pdf · What is Apache Spark? 4 A cluster-based computing engine Developed since 2012

Preliminaries using Spark in R-Studio

10

.libPaths(c(.libPaths(), '/opt/spark-1.6.1-bin-hadoop2.6/R/lib'))Sys.setenv(SPARK_HOME =

'/opt/spark-1.6.1-bin-hadoop2.6')

Sys.setenv(PATH = paste(Sys.getenv(c('PATH')), '/opt/spark-1.6.1-bin-hadoop2.6/bin', sep =

':'))

library(SparkR)

d.csv <- "com.databricks:spark-csv_2.11:1.4.0„

d.pg <- "org.postgresql:postgresql94-jdbc-9.4:1207-1"

sc <- sparkR.init(sparkPackages=c(d.csv))

sqlContext <- sparkRSQL.init(sc)

Page 11: Introduction to Apache Spark - Meetupfiles.meetup.com/3576292/Dubravko Dulic SparkR June 2016.pdf · What is Apache Spark? 4 A cluster-based computing engine Developed since 2012

Get data from Spark

11

s.df <- read.df(sqlContext,

source = "com.databricks.spark.csv",

path = "/var/data/server-sample.log",

delimiter = " ", header = "false")

cache(s.df) # Bring Spark data.frame to R

registerTempTable(s.df, "logs")

rc <- sql(sqlContext, "SELECT C0 AS ip, COUNT(*) AS n FROM logs GROUP BY C0 ORDER BY

COUNT(*) DESC")

Page 12: Introduction to Apache Spark - Meetupfiles.meetup.com/3576292/Dubravko Dulic SparkR June 2016.pdf · What is Apache Spark? 4 A cluster-based computing engine Developed since 2012

Analysing Sensorial Data

12

GPS

Pressure

Temperature

Acceleration

Hadoop Infrastructure

Hadoop

HDFS

Storm

Hive

Spark

Other Datasources

(Weather, Height

Data, GIS Data

Developing

predictive models

R&D Dashboards

Page 13: Introduction to Apache Spark - Meetupfiles.meetup.com/3576292/Dubravko Dulic SparkR June 2016.pdf · What is Apache Spark? 4 A cluster-based computing engine Developed since 2012

Further Sources

13

https://spark.apache.org/docs/latest/api/R/index.html

https://spark.apache.org/docs/latest/sparkr.html

Page 14: Introduction to Apache Spark - Meetupfiles.meetup.com/3576292/Dubravko Dulic SparkR June 2016.pdf · What is Apache Spark? 4 A cluster-based computing engine Developed since 2012

© dsquare.de (2007-2015):

Diese Präsentation ist urheberrechtlich geschützt. Alle Nutzungs- und Verwertungsrechte liegen exklusiv bei der dsquare.de. Jede urheberrechtlich relevante Nutzung oder

Verwertung dieser Präsentation oder von Teilen dieser Präsentation ist nur mit ausdrücklicher schriftlicher Zustimmung von dsquare.de zulässig. Dies gilt auch für die Weitergabe

dieser Präsentation oder von Teilen dieser Präsentation an Dritte, für die diese Präsentation nicht bestimmt ist.

Für Fragen stehen wir Ihnen gerne zur Verfügung!