Apache Spark Introduction

download Apache Spark Introduction

of 24

  • date post

    19-Mar-2017
  • Category

    Technology

  • view

    78
  • download

    1

Embed Size (px)

Transcript of Apache Spark Introduction

Spark Conf Taiwan 2016

Apache SparkRICH LEE2016/9/21

AgendaSpark overviewSpark coreRDDSpark Application DevelopSpark ShellZepplineApplication

Spark OverviewApache Spark is a fast and general-purpose cluster computing system

Key Features:FastEase of UseGeneral-purposeScalableFault tolerant

Logistic regression in Hadoop and Spark

Spark OverviewCluster ModeLocalStandaloneHadoop YARNApache Mesos

Spark Overview

Spark OverviewSpark High Level ArchitectureDriver ProgramCluster ManagementWorker NodeExecutorTask

Spark OverviewInstall and startupDownloadhttp://spark.apache.org/downloads.htmlStart Master and Worker./sbin/start-all.shhttp://localhost:8080Start History server./sbin/start-history-server.sh hdfs://localhost:9000/spark/directoryhttp://localhost:18080Start Spark-Shell./bin/spark-shell --master "spark://RichdeMacBook-Pro.local:7077"./bin/spark-shell local[4]

RDDResilient Distributed DataseRDD represents a collection of partitioned data elements that can be operated on in parallel. It is the primary data abstraction mechanism in Spark.PartitionedFault TolerantInterfaceIn Memory

RDDCreate RDDparallelizeval xs = (1 to 10000).toListval rdd = sc.parallelize(xs)

textFile val lines = sc.textFile("/input/README.md") val lines = sc.textFile("file:///RICH_HD/BigData_Tools/spark-1.6.2/README.md")HDFS - "hdfs://"Amazon S3 - "s3n://"Cassandra, HBase

RDD

RDDTransformation Creates a new RDD by performing a computation on the source RDDmap val txtFile = sc.textFile("/input/README.md") val lengths = lines map { l => l.length}

flatMap val words = lines flatMap { l => l.split(" ")}

filter val longLines = lines filter { l => l.length > 80}

RDDActionReturn a value to a driver programfirstval numbersRdd = sc.parallelize(List(10, 5, 3, 1))val firstElement = numbersRdd.firstmaxnumbersRdd.maxreduceval sum = numbersRdd.reduce((x, y) => x + y)val product = numbersRdd.reduce((x, y) => x * y)

RDDFilter log example:

val logs = sc.textFile("path/to/log-files")val errorLogs = logs filter { l => l.contains("ERROR")}val warningLogs = logs filter { l => l.contains("WARN")}val errorCount = errorLogs.countval warningCount = warningLogs.count

log RDDerror RDDwarn RDDcountcount

RDDCachingStores an RDD in the memory or storageWhen an application caches an RDD in memory, Spark stores it in the executor memory on each worker node. Each executor stores in memory the RDD partitions that it computes.

cachepersistMEMORY_ONLYDISK_ONLYMEMORY_AND_DISKMEMORY_ONLY_SERMEMORY_AND_DISK_SER

RDDCache example:

val logs = sc.textFile("path/to/log-files")val errorsAndWarnings = logs filter { l => l.contains("ERROR") || l.contains("WARN")}errorsAndWarnings.cache()val errorLogs = errorsAndWarnings filter { l => l.contains("ERROR")}val warningLogs = errorsAndWarnings filter { l => l.contains("WARN")}val errorCount = errorLogs.countval warningCount = warningLogs.count

Spark Application DevelopSpark-ShellZepplineApplication (Java/Scala)spark-submit

Spark Application Develop WordCountval textFile = sc.textFile("/input/README.md")val wcData = textFile.flatMap(line => line.split(" ")) .map((_, 1)) .reduceByKey(_ + _)

wcData.collect().foreach(println)

Taiwan Hadoop User Grouphttps://www.facebook.com/groups/hadoop.tw/Taiwan Spark User Grouphttps://www.facebook.com/groups/spark.tw/