Adios hadoop, Hola Spark! T3chfest 2015

Adiós Hadoop Hola Spark!

1

@dhiguero [email protected]

Daniel Higuero

•  Introducción •  Spark §  Conceptos básicos §  Ecosistema

Agenda

2

3

VIEWER DISCRETION IS ADVISED

All elephants are innocent un3l proven guilty in a court of development

Opinions expressed are solely my own and do not express the views or opinions of my employer.

Introducción

4

Timeline

#t3chfest2015 5

2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Google MapReduce

paper

Google GFS paper

Timeline

#t3chfest2015 6

2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Google MapReduce

paper

Google GFS paper Hive

HBase

Hadoop 1TB, 910 nodes < 4

min

Timeline

#t3chfest2015 7

2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Google MapReduce

paper


HBase


min

alpha-‐0.1

Spark 0.7

Timeline

#t3chfest2015 8

2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Google MapReduce

paper


HBase


min

Hadoop 103 TB, 2100 nodes, 72

min

alpha-‐0.1

Spark 0.7

Timeline

#t3chfest2015 9

2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Google MapReduce

paper


HBase


min

Spark 100 TB, 206 nodes, 23

min

Hadoop 103 TB, 2100 nodes, 72

min

alpha-‐0.1

Spark 0.7 Spark 1.2+

o  ¿Qué es Spark? o  Framework de procesamiento paralelo

o  Historia

Introducción

10

https://spark.apache.org/

Apache SoOware Founda3on

#t3chfest2015

o  Concepto de programación funcional

o  Popularizado por Google

Map-reduce

11

(map 'list (lambda (x) (+ x 10)) '(1 2 3 4)) => (11 12 13 14)

(reduce #'+ '(1 2 3 4)) => 10

Jeff Dean and Sanjay Ghemawat. "MapReduce: Simplified Data Processing on Large Clusters." OSDI (2004)

#t3chfest2015

Map-Reduce

12

Input data

Map

Map

Map

Map

Reduce

Reduce

Reduce

result

#t3chfest2015

Map-Reduce

13 #t3chfest2015

val wordCounts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)

Apache Spark is an open-‐source cluster compu3ng framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-‐stage disk-‐based MapReduce paradigm, Spark's in-‐memory primi3ves provide performance up to 100 3mes faster for certain applica3ons. By allowing user

programs to load data into a cluster's memory and query it repeatedly, Spark is well suited to machine

learning algorithms

Array[String]

Apache

Spark

is

an

open-‐source

cluster

…

Array[(String, Int)]

(Apache, 1)

(Spark, 1)

(is, 1)

…

(Spark, 1)

(is, 1)

…

Array[(String, Int)]

(Apache, 1)

(Spark, 2)

(is, 2)

…

(to, 4)

(the, 1)

…

Source: Wikipedia

o  Mayor flexibilidad en la definición de transformaciones

o  Menor uso de almacenamiento en disco o  Aprovechamiento de la memoria o  Tolerancia a fallos o  Tracción de la comunidad

Ventajas de Spark

14 #t3chfest2015

Conceptos básicos

15

o  Abstracción básica en Spark o  Contiene las transformaciones que se van a

realizar sobre un conjunto de datos •  Inmutable •  Lazy evaluation •  En caso de fallo se puede recuperar el estado •  Control de persistencia y particionado

RDD

16 #t3chfest2015

Ecosistema

17

o  Proporciona las abstracciones básicas y se encarga del scheduling

Spark core engine

19

RDD DAG Scheduling

Cluster manager

Threads

Block manager

Task scheduling

Worker

#t3chfest2015

o  Permite transformar una fuente streaming en un conjunto de mini-batch •  Definición de una ventana §  Temporal

Spark Streaming

20 #t3chfest2015

Spark Streaming

21

Window = 5

batch0 batch1 batch2 batch3 batch4 batch5 batch6 batch7

3empo

3empo

#t3chfest2015

o  Librería para Machine Learning o  Abstracciones útiles para cómputo o  Vectores, Matrices dispersas

o  Implementación de algoritmos conocidos o  Clasificación, regresión, collaborative

filtering y clustering

MLlib

22 #t3chfest2015

o  Capa de acceso SQL para ejecutar operaciones sobre RDD

o  DataFrame (antes SchemaRDD)

SparkSQL

23

val people = sqlContext.parquetFile("...") val department = sqlContext.parquetFile("...") people.filter("age" > 30) .join(department, people("deptId") === department("id")) .groupBy(department("name"), "gender”) © databricks

#t3chfest2015

Primeros pasos

24

$ wget http://www.apache.org/.../spark-‐1.2.0-‐bin-‐hadoop2.4.tgz $ tar xvzf spark-‐1.2.0-‐bin-‐hadoop2.4.tgz $ cd spark-‐1.2.0-‐bin-‐hadoop2.4 $ cp conf/spark-‐env.sh.template conf/spark-‐env.sh $ ./bin/spark-‐shell

$ ./bin/spark-‐shell … 15/02/09 15:47:50 INFO HttpServer: Starting HTTP Server 15/02/09 15:47:50 INFO Utils: Successfully started service 'HTTP class server' on port 60416. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.2.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-‐Bit Server VM, Java 1.7.0_71) Type in expressions to have them evaluated. scala>

hep://localhost:4040

#t3chfest2015

25

WE ARE HIRING!

Java

Scala

Ping pong

Nerf Big

Data

Spark Hadoop

Cassandra

MongoDB

NoSQL

Passion

BIG DATA CHILD`S PLAY

@dhiguero [email protected]

Daniel Higuero

Acknowledgements: This work has been partially funded by the Spanish Ministry of Economy and Competitiveness under grant PTQ-13-05997

Adios hadoop, Hola Spark! T3chfest 2015

Data & Analytics

Transcript of Adios hadoop, Hola Spark! T3chfest 2015