Intro to Apache Spark
Embed Size (px)
Transcript of Intro to Apache Spark
1. Intro to Apache Spark Marius Soutier Freelance Software Engineer @mariussoutier Clustered In-Memory Computation 2. Motivation Classical data architectures break down RDMBS cant handle large amounts of data well Most RDMBS cant handle multiple input formats Most NoSQLs dont offer analytics Problem Running computations on BigData 3. The 3 Vs of Big Data Volume 100s of GB, TB, PB Variety Structured, Unstructured, Semi-Structured Velocity Sensors, Realtime Fast Data 4. Hadoop (1) De-facto standard for running computations on large amounts of different data is Hadoop Hadoop consists of HDFS distributed, fault-tolerant le system Map/Reduce parallelizable computations pioneered by Google Hadoop is typically run on a (large) cluster of non-virtualized commodity hardware 5. Hadoop (2) However, Map/Reduce are batch jobs with high latency Not suitable for interactive queries, real-time analytics, or Machine Learning Pure Map/Reduce is hard to develop and maintain 6. Enter Spark Spark is a framework for clustered in-memory data processing 7. Developed at UC Berkeley, released in 2010 Apache Top-Level Project Since February 2014, current version is 1.2.1 / 1.3.0 USP: Uses cluster-wide available memory to speed up computations Very active community Apache Spark (1) 8. Written in Scala (& Akka), APIs for Java and Python Programming model is a collection pipeline* instead of Map/Reduce Supports batch, streaming, interactive, or all combined using unied API Apache Spark (2) * http://martinfowler.com/articles/collection-pipeline/ 9. Spark Ecosystem Spark Core Spark SQL Spark Hive BlinkDB Approximate SQL Spark Streaming MLlib Machine Learning GraphX SparkR ALPHA ALPHA ALPHA Tachyon 10. Spark is a framework for clustered in-memory data processing Spark is a platform for data driven products. 11. Base abstraction Resilient Distributed Dataset (RDD) Essentially a distributed collection of objects Can be cached in memory or on disk RDD 12. RDD Word Count val sc = new SparkContext() val input: RDD[String] = sc.textFile("/tmp/word.txt") val words: RDD[(String, Long)] = input .flatMap(line => line.toLowerCase.split("s+")) .map(word => word -> 1L) .cache() val wordCountsRdd: RDD[(String, Long)] = words .reduceByKey(_ + _) .sortByKey() val wordCounts: Array[(String, Long)] = wordCountsRdd.collect() 13. Cluster Driver SparkContext Master Worker Executor Worker Executor Tasks Tasks Spark app (driver) builds DAG from RDD operations DAG is split into tasks that are executed by workers 14. Example Architecture Input HDFS Message Queue Spark Streaming Spark Batch Jobs SparkSQL Real-Time Dashboard Interactive SQL Analytics, Reports 15. Demo Questions?