Adios hadoop, Hola Spark! T3chfest 2015
-
Upload
dhiguero -
Category
Data & Analytics
-
view
679 -
download
1
Transcript of Adios hadoop, Hola Spark! T3chfest 2015
3
VIEWER DISCRETION IS ADVISED
All elephants are innocent un3l proven guilty in a court of development
Opinions expressed are solely my own and do not express the views or opinions of my employer.
Timeline
#t3chfest2015 5
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Google MapReduce
paper
Google GFS paper
Timeline
#t3chfest2015 6
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Google MapReduce
paper
Google GFS paper Hive
HBase
Hadoop 1TB, 910 nodes < 4
min
Timeline
#t3chfest2015 7
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Google MapReduce
paper
Google GFS paper Hive
HBase
Hadoop 1TB, 910 nodes < 4
min
alpha-‐0.1
Spark 0.7
Timeline
#t3chfest2015 8
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Google MapReduce
paper
Google GFS paper Hive
HBase
Hadoop 1TB, 910 nodes < 4
min
Hadoop 103 TB, 2100 nodes, 72
min
alpha-‐0.1
Spark 0.7
Timeline
#t3chfest2015 9
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Google MapReduce
paper
Google GFS paper Hive
HBase
Hadoop 1TB, 910 nodes < 4
min
Spark 100 TB, 206 nodes, 23
min
Hadoop 103 TB, 2100 nodes, 72
min
alpha-‐0.1
Spark 0.7 Spark 1.2+
o ¿Qué es Spark? o Framework de procesamiento paralelo
o Historia
Introducción
10
https://spark.apache.org/
Apache SoOware Founda3on
#t3chfest2015
o Concepto de programación funcional
o Popularizado por Google
Map-reduce
11
(map 'list (lambda (x) (+ x 10)) '(1 2 3 4)) => (11 12 13 14)
(reduce #'+ '(1 2 3 4)) => 10
Jeff Dean and Sanjay Ghemawat. "MapReduce: Simplified Data Processing on Large Clusters." OSDI (2004)
#t3chfest2015
Map-Reduce
13 #t3chfest2015
val wordCounts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
Apache Spark is an open-‐source cluster compu3ng framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-‐stage disk-‐based MapReduce paradigm, Spark's in-‐memory primi3ves provide performance up to 100 3mes faster for certain applica3ons. By allowing user
programs to load data into a cluster's memory and query it repeatedly, Spark is well suited to machine
learning algorithms
Array[String]
Apache
Spark
is
an
open-‐source
cluster
…
Array[(String, Int)]
(Apache, 1)
(Spark, 1)
(is, 1)
…
(Spark, 1)
(is, 1)
…
Array[(String, Int)]
(Apache, 1)
(Spark, 2)
(is, 2)
…
(to, 4)
(the, 1)
…
Source: Wikipedia
o Mayor flexibilidad en la definición de transformaciones
o Menor uso de almacenamiento en disco o Aprovechamiento de la memoria o Tolerancia a fallos o Tracción de la comunidad
Ventajas de Spark
14 #t3chfest2015
o Abstracción básica en Spark o Contiene las transformaciones que se van a
realizar sobre un conjunto de datos • Inmutable • Lazy evaluation • En caso de fallo se puede recuperar el estado • Control de persistencia y particionado
RDD
16 #t3chfest2015
o Proporciona las abstracciones básicas y se encarga del scheduling
Spark core engine
19
RDD DAG Scheduling
Cluster manager
Threads
Block manager
Task scheduling
Worker
#t3chfest2015
o Permite transformar una fuente streaming en un conjunto de mini-batch • Definición de una ventana § Temporal
Spark Streaming
20 #t3chfest2015
Spark Streaming
21
Window = 5
batch0 batch1 batch2 batch3 batch4 batch5 batch6 batch7
3empo
3empo
#t3chfest2015
o Librería para Machine Learning o Abstracciones útiles para cómputo o Vectores, Matrices dispersas
o Implementación de algoritmos conocidos o Clasificación, regresión, collaborative
filtering y clustering
MLlib
22 #t3chfest2015
o Capa de acceso SQL para ejecutar operaciones sobre RDD
o DataFrame (antes SchemaRDD)
SparkSQL
23
val people = sqlContext.parquetFile("...") val department = sqlContext.parquetFile("...") people.filter("age" > 30) .join(department, people("deptId") === department("id")) .groupBy(department("name"), "gender”) © databricks
#t3chfest2015
Primeros pasos
24
$ wget http://www.apache.org/.../spark-‐1.2.0-‐bin-‐hadoop2.4.tgz $ tar xvzf spark-‐1.2.0-‐bin-‐hadoop2.4.tgz $ cd spark-‐1.2.0-‐bin-‐hadoop2.4 $ cp conf/spark-‐env.sh.template conf/spark-‐env.sh $ ./bin/spark-‐shell
$ ./bin/spark-‐shell … 15/02/09 15:47:50 INFO HttpServer: Starting HTTP Server 15/02/09 15:47:50 INFO Utils: Successfully started service 'HTTP class server' on port 60416. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.2.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-‐Bit Server VM, Java 1.7.0_71) Type in expressions to have them evaluated. scala>
hep://localhost:4040
#t3chfest2015
BIG DATA CHILD`S PLAY
@dhiguero [email protected]
Daniel Higuero
Acknowledgements: This work has been partially funded by the Spanish Ministry of Economy and Competitiveness under grant PTQ-13-05997