Apache Spark 101

download Apache Spark 101

of 45

Embed Size (px)

Transcript of Apache Spark 101

  • ApacheSpark101June2016

    Abdullah Cetin CAVDAR

    @accavdar

    #AnkaraSparkDay

    http://linkedin.com/in/accavdarhttp://twitter.com/accavdar

  • ApacheSpark'sGoal

  • ApacheSparkisafastandgeneralengineforlarge-scaledataprocessing

  • MostActiveProjectinBigData

    SparkSurvey2015

    https://databricks.com/blog/2015/09/24/spark-survey-2015-results-are-now-available.html

  • Top10IndustriesUsingSpark

  • ManyTypesofProduct

  • SparkEngine

    unifiedengineacrossdiverseworkloads&environments

  • ProgrammingLanguages

  • OpenSourceSparkEcosystem

  • MostImportantAspects

  • SparkProgramming

    Model

  • Challenge?Fastdatasharingacrossparallel

    jobs

  • DataSharinginMapReduce

  • DataSharinginApacheSpark

  • Components

  • ClusterManagers

  • InitializingApacheSparkSparkConfandSparkContext

  • ApacheSparkShellPythonandScala

  • RDD(ResilientDistributedDataset)AnRDDisaread-onlycollectionofobjectspartitionedacrossasetofmachinesthat

    canberebuiltifapartitionislost

  • RDDRead-Only=Immutable

    ParallelismCaching

  • RDDPartitioned=Distributed

    More partitions = More parallelism

  • RDDRebuilt=Resilient

    Recover lost data partitionsBy replaying data lineage

  • RDDOperations

  • RDDOperations

  • Partitions

    logicaldivisionofdata/basicunitofparallelisim

  • RDDLineage

    LazyEvaluation

  • DAG(DirectedAcyclicGraph)

  • Transformation&Action

  • RDDCreationParallelizing a collection

    into driver application memoryfor only prototyping and testing

    Loading an external data setle://, hdfs://, s3n://sc.textFile()sc.hadoopFile(), sc.newAPIHadoopFile()sqlContext.read()

  • WordCount:)

  • Driver&WorkersMain Program is executed on DriverTransformations are executed on WorkersActions transfer from Workers to DriverDriver cannot get data from executors except action and accumulator

  • RDDDependencies

    Minimizeshuffle/WideDependencies

  • RDDPersistence/Caching

    persist()orcache()

    Without cache, it will restart from the rst RDDLRU (Least Recently Used)Default Storega Level: MEMORY_ONLY

  • StorageLevels

  • SharedVariablesAccumulatorsandBroadcast

    Variables

  • AccumulatorsUsedtoimplementcountersorsums

  • BroadcastVariablesKeeparead-onlyvariablecachedoneach

    machine

  • SparkUIDefaultport4040

  • DeployingtoaClusterUsespark-submit

  • DataFrames&PerformanceDistributed collection of rows organized

    into named columns

  • TipsAvoid groupByKey and wide dependenciesUse enough number of partitionsUse coalesce not to make too many small lesBe cautious on Serialization/Deserialization

  • MajorFeaturesin2.0

  • Thankyou

  • #AnkaraSparkDay