Apache Spark Introduction
-
Upload
rich-lee -
Category
Technology
-
view
108 -
download
1
Embed Size (px)
Transcript of Apache Spark Introduction

Spark Conf Taiwan 2016

Apache SparkRICH LEE2016/9/21

AgendaSpark overview
Spark coreRDD
Spark Application DevelopSpark Shell
Zeppline
Application



Spark OverviewApache Spark is a fast and general-purpose cluster computing system
Key Features:FastEase of UseGeneral-purposeScalableFault tolerant
Logistic regression in Hadoop and Spark


Spark OverviewCluster Mode
Local
Standalone
Hadoop YARN
Apache Mesos

Spark Overview

Spark OverviewSpark High Level Architecture
Driver Program
Cluster Management
Worker Node
Executor
Task


Spark OverviewInstall and startup
Download
http://spark.apache.org/downloads.html
Start Master and Worker
./sbin/start-all.sh
http://localhost:8080
Start History server
./sbin/start-history-server.sh hdfs://localhost:9000/spark/directory
http://localhost:18080
Start Spark-Shell
./bin/spark-shell --master "spark://RichdeMacBook-Pro.local:7077"
./bin/spark-shell local[4]

RDDResilient Distributed Datase
RDD represents a collection of partitioned data elements that can be operated on in parallel. It is the primary data abstraction mechanism in Spark.
PartitionedFault TolerantInterfaceIn Memory

RDDCreate RDD
parallelizeval xs = (1 to 10000).toListval rdd = sc.parallelize(xs)
textFile val lines = sc.textFile("/input/README.md") val lines = sc.textFile("file:///RICH_HD/BigData_Tools/spark-
1.6.2/README.md")HDFS - "hdfs://"
Amazon S3 - "s3n://"Cassandra, HBase

RDD

RDDTransformation
Creates a new RDD by performing a computation on the source RDDmap val txtFile = sc.textFile("/input/README.md") val lengths = lines map { l => l.length}
flatMap val words = lines flatMap { l => l.split(" ")}
filter val longLines = lines filter { l => l.length > 80}

RDDAction
Return a value to a driver program
firstval numbersRdd = sc.parallelize(List(10, 5, 3, 1))val firstElement = numbersRdd.first
maxnumbersRdd.max
reduceval sum = numbersRdd.reduce((x, y) => x + y)val product = numbersRdd.reduce((x, y) => x * y)

RDDFilter log example:
val logs = sc.textFile("path/to/log-files")val errorLogs = logs filter { l => l.contains("ERROR")}val warningLogs = logs filter { l => l.contains("WARN")}val errorCount = errorLogs.countval warningCount = warningLogs.count
log RDD
error RDD
warn RDD
count count

RDDCaching
Stores an RDD in the memory or storageWhen an application caches an RDD in memory, Spark stores it in the
executor memory on each worker node. Each executor stores in memory the RDD partitions that it computes.
cache
persistMEMORY_ONLYDISK_ONLYMEMORY_AND_DISKMEMORY_ONLY_SERMEMORY_AND_DISK_SER

RDDCache example:
val logs = sc.textFile("path/to/log-files")val errorsAndWarnings = logs filter { l => l.contains("ERROR") || l.contains("WARN")}errorsAndWarnings.cache()val errorLogs = errorsAndWarnings filter { l => l.contains("ERROR")}val warningLogs = errorsAndWarnings filter { l => l.contains("WARN")}val errorCount = errorLogs.countval warningCount = warningLogs.count


Spark Application DevelopSpark-Shell
Zeppline
Application (Java/Scala)spark-submit

Spark Application Develop WordCount
val textFile = sc.textFile("/input/README.md")val wcData = textFile.flatMap(line => line.split(" ")) .map((_, 1)) .reduceByKey(_ + _)
wcData.collect().foreach(println)

工商時間 Taiwan Hadoop User Group
https://www.facebook.com/groups/hadoop.tw/
Taiwan Spark User Group
https://www.facebook.com/groups/spark.tw/