Scala and spark

download Scala and spark

of 49

  • date post

    12-Jul-2015
  • Category

    Engineering

  • view

    716
  • download

    3

Embed Size (px)

Transcript of Scala and spark

  • Introduction toScala and SparkCiaociao

    Vai a fare

    ciao ciaoDr. Fabio Fumarola

  • ContentsHadoop quick introductionAn introduction to sparkSpark Architecture & Programming Model*

  • HadoopAn Open-Source software for distributed storage of large dataset on commodity hardwareProvides a programming model/framework for processing large dataset in parallel

    *

  • Limitations of Map ReduceSlow due to replication, serialization, and disk IOInefficient for:Iterative algorithms (Machine Learning, Graphs & Network Analysis)Interactive Data Mining (R, Excel, Ad hoc Reporting, Searching)

    *

  • Solutions?Leverage to memory:load Data into MemoryReplace disks with SSD*

  • Apache SparkA big data analytics cluster-computing framework written in Scala.Open Sourced originally in AMPLab at UC BerkleyProvides in-memory analytics based on RDDHighly compatible with Hadoop Storage APICan run on top of an Hadoop clusterDeveloper can write programs using multiple programming languages*

  • Spark architecture*HDFS

  • Spark*

  • Spark*Not tied to 2 stage Map Reduce paradigmExtract a working setCache itQuery it repeatedlyHDFS read

  • Spark Programming Model*SparkContext

  • Spark Programming Model*RDD(Resilient Distributed Dataset)Immutable Data structureIn-memory (explicitly)Fault TolerantParallel Data StructureControlled partitioning to optimize data placementCan be manipulated using rich set of operators.

  • RDDProgramming Interface: Programmer can perform 3 types of operations

    *Transformations

    Create a new dataset from and existing one.

    Lazy in nature. They are executed only when some action is performed.

    Example :Map(func)Filter(func)Distinct()

    Actions

    Returns to the driver program a value or exports data to a storage system after performing a computation.

    Example:Count()Reduce(funct)CollectTake()

    Persistence

    For caching datasets in-memory for future operations.

    Option to store on disk or RAM or mixed (Storage Level).

    Example:Persist() Cache()

  • How Spark worksRDD: Parallel collection with partitions User application create RDDs, transform them, and run actions.This results in a DAG (Directed Acyclic Graph) of operators.DAG is compiled into stagesEach stage is executed as a series of Task (one Task for each Partition).

    *

  • Example*sc.textFile(/wiki/pagecounts)RDD[String]textFile

  • Example*sc.textFile(/wiki/pagecounts).map(line => line.split(\t))RDD[String]textFilemapRDD[List[String]]

  • Example*sc.textFile(/wiki/pagecounts).map(line => line.split(\t)).map(R => (R[0], int(R[1])))RDD[String]textFilemapRDD[List[String]]RDD[(String, Int)]map

  • Example*sc.textFile(/wiki/pagecounts).map(line => line.split(\t)).map(R => (R[0], int(R[1]))).reduceByKey(_+_)RDD[String]textFilemapRDD[List[String]]RDD[(String, Int)]mapRDD[(String, Int)]reduceByKey

  • Example*sc.textFile(/wiki/pagecounts).map(line => line.split(\t)).map(R => (R[0], int(R[1]))).reduceByKey(_+_, 3).collect()RDD[String]RDD[List[String]]RDD[(String, Int)]RDD[(String, Int)]reduceByKeyArray[(String, Int)]collect

  • Execution PlanStages are sequences of RDDs, that dont have a Shuffle in between

    *textFilemapmapreduceByKeycollectStage 1Stage 2

  • Execution Plan*Stage 1Stage 2Read HDFS splitApply both the mapsStart Partial reduceWrite shuffle dataRead shuffle dataFinal reduceSend result to driver program

  • Stage ExecutionCreate a task for each Partition in the new RDDSerialize the TaskSchedule and ship Tasks to Slaves

    And all this happens internally (you need to do anything)

    *Task 1Task 2Task 2Task 2

  • Spark Executor (Slaves)*Core 1Core 2Core 3

  • Summary of ComponentsTask: The fundamental unit of execution in Spark

    Stage: Set of Tasks that run parallel

    DAG: Logical Graph of RDD operations

    RDD: Parallel dataset with partitions

    *

  • Start the docker containerFromhttps://github.com/sequenceiq/docker-spark

    docker run -i -t -h sandbox sequenceiq/spark:1.1.1-ubuntu /etc/bootstrap.sh bashRun the spark shell using yarn or localspark-shell --master yarn-client --driver-memory 1g --executor-memory 1g --executor-cores 2

    *

  • Running the example and ShellTo Run the examples$ run-example SparkPi 10We can start a spark shell viaspark-shell -- master local nThe -- master specifies the master URL for a distributed clusterExample applications are also provided in Pythonspark-submit example/src/main/python/pi.py 10*

  • Collections and External DatasetsA Collection can be parallelized using the SparkContext val data = Array(1, 2, 3, 4, 5)val distData = sc.parallelize(data)Spark can create distributed dataset from HDFS, Cassandra, Hbase, Amazon S3, etc. Spark supports text files, Sequence Files and any other Hadoop input formatFiles can be read from an URI local or remote (hdfs://, s3n://)scala> val distFile = sc.textFile("data.txt")distFile: RDD[String] = MappedRDD@1d4cee08distFile.map(s => s.length).reduce((a,b) => a + b)*

  • RDD operationsCount the length of the words in the fileval lines = sc.textFile("data.txt")val lineLengths = lines.map(s => s.length)val totalLength = lineLengths.reduce((a, b) => a + b)If we want to use lineLengths later we can runlineLengths.persist()This will save in the memory the value of lineLengths before reducing*

  • Passing a function to SparkSpark is based on Anonymous function syntax(x: Int) => x *xWhich is a shorthand fornew Function1[Int,Int] { def apply(x: Int) = x * x}We can define functions with more parameters and without(x: Int, y: Int) => "(" + x + ", " + y + ")() => { System.getProperty("user.dir") }The syntax is a shorthand forFuntion1[T,+E] Function22[]*

  • Passing a function to Sparkobject MyFunctions { def func1(s: String): String = s + s}

    file.map(MyFunctions.func1)

    class MyClass { def func1(s: String): String = { ... } def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(func1) }}*

  • Working with Key-Value PairsWe can setup RDD with key-value pairs that are caster to Tuple2 typeval lines = sc.textFile("data.txt")val pairs = lines.map(s => (s, 1))val counts = pairs.reduceByKey((a, b) => a + b)We can use counts.sortByKey() to sortAnd finally counts.collect() to bring them backNOTE: when using custom objects as key-value we should be sure that they have the method equals() with hashcode() http://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode()*

  • TransformationsThere are several transformations supported by SparkMapFilterflatMapmapPartitions.http://spark.apache.org/docs/latest/programming-guide.htmlWhen they are executed?*

  • ActionsThe following table lists some of the common actions supported:ReduceCollectCountFirstTaketakeSample*

  • RDD PersistenceOne of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operationsCaching is a key tool for iterative algorithms and fast interactive useYou can mark an RDD to be persisted using the persist() or cache() methods on itThe first time it is computed in an action, it will be kept in memory on the nodes. Sparks cache is fault-tolerant if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.*

  • RDD persistenceIn addition, each persisted RDD can be stored using a different storage level, for example we can persist the dataset on disk, in memory but as serialized Java objects (to save space), replicate it across nodes, off-heap in TachyonNote: In Python, stored objects will always be serialized with the Pickle library, so it does not matter whether you choose a serialized level.Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist*

  • Which Storage Level to Choose?Memory only if that fit in the main memoryIf not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access.Dont spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.Use the replicated storage levels if you want fast fault recovery Use OFF_HEAP in environments with hig amounts of memory used or multiple applications*

  • Shared VariablesNormally when functions are executed on a remote node it works on immutable copiesHowever, Sparks does provide two types of shared variables for two usages:Broadcast variablesAccumulators

    *

  • Broadcast VariablesBroadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

    scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

    scala> broadcastVar.valueres0: Array[Int] = Array(1, 2, 3)*

  • AccumulatorsAccumulators are variables that are only added to through an associative operation and can therefore be efficiently supported in parallelSpark natively supports accumulators of numeric types, and programmers can add support for new typesNote: not yet supported on Python

    scala> val accum = sc.accumulator(0, "My Accumulator")accum: spark.Accumulator[Int] = 0scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)scala> accum.valueres7: Int = 10*

  • Accumulatorsobject VectorAccumulatorParam extends Accu