SD CADD meeting 2016-08-30: Intro to Spark

8
Introduc)on to Apache Spark Peter Rose [email protected]

Transcript of SD CADD meeting 2016-08-30: Intro to Spark

Introduc)on  to  Apache  Spark  

Peter  Rose  [email protected]  

Apache  Spark  is  a  fast  and  general  engine  for  large-­‐scale  data  processing  •  In-­‐memory  processing    Successor  of  Hadoop  (MapReduce)  •  File-­‐based  processing  

hDp://spark.apache.org/  

Spark  Ecosystem  

Apache  Spark  works  in  parallel  on  •  Mul)core  laptop,  desktop  •  Single  server  •  Cluster  (need  cluster  manager)  

RDD<String>   RDD<String>   PairRDD<String,Integer>   PairRDD<String,Integer>  

Map-­‐Reduce  Example  

one  to  many   one  to  one  

Scalable  machine    learning  library  

Module  for  running  queries  on  structured  data  

Data  Sources  

Module  to  build  scalable  fault-­‐tolerant  streaming  applica)ons  Core  Data  Structures