Apache Spark & Hadoop

download Apache Spark & Hadoop

of 73

  • date post

    08-Sep-2014
  • Category

    Technology

  • view

    806
  • download

    3

Embed Size (px)

description

Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now. That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. Keys Botzum - Senior Principal Technologist with MapR Technologies Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.

Transcript of Apache Spark & Hadoop

  • 2014 MapR Technologies 1 2014 MapR Technologies Apache Spark Keys Botzum Senior Principal Technologist, MapR Technologies June 2014
  • 2014 MapR Technologies 2 Agenda MapReduce Apache Spark How Spark Works Fault Tolerance and Performance Examples Spark and More
  • 2014 MapR Technologies 3 MapR: Best Product, Best Business & Best Customers Top Ranked Exponential Growth 500+ Customers Cloud Leaders 3X bookings Q1 13 Q1 14 80% of accounts expand 3X 90% software licenses $1B in incremental revenue generated by 1 customer
  • 2014 MapR Technologies 4 2014 MapR Technologies Review: MapReduce
  • 2014 MapR Technologies 5 MapReduce: A Programming Model MapReduce: Simplified Data Processing on Large Clusters (published 2004) Parallel and Distributed Algorithm: Data Locality Fault Tolerance Linear Scalability
  • 2014 MapR Technologies 6 MapReduce Basics Assumes scalable distributed file system that shards data Map Loading of the data and defining a set of keys Reduce Collects the organized key-based data to process and output Performance can be tweaked based on known details of your source files and cluster shape (size, total number)
  • 2014 MapR Technologies 7 MapReduce Processing Model Define mappers Shuffling is automatic Define reducers For complex work, chain jobs together
  • 2014 MapR Technologies 8 MapReduce: The Good Built in fault tolerance Optimized IO path Scalable Developer focuses on Map/Reduce, not infrastructure simple? API
  • 2014 MapR Technologies 9 MapReduce: The Bad Optimized for disk IO Doesnt leverage memory well Iterative algorithms go through disk IO path again and again Primitive API Developers have to build on very simple abstraction Key/Value in/out Even basic things like join require extensive code Result often many files that need to be combined appropriately
  • 2014 MapR Technologies 10 2014 MapR Technologies Apache Spark
  • 2014 MapR Technologies 11 Apache Spark spark.apache.org github.com/apache/spark user@spark.apache.org Originally developed in 2009 in UC Berkeleys AMP Lab Fully open sourced in 2010 now at Apache Software Foundation - Commercial Vendor Developing/Supporting
  • 2014 MapR Technologies 12 Spark: Easy and Fast Big Data Easy to Develop Rich APIs in Java, Scala, Python Interactive shell Fast to Run General execution graphs In-memory storage 2-5 less code
  • 2014 MapR Technologies 13 Resilient Distributed Datasets (RDD) Spark revolves around RDDs Fault-tolerant read only collection of elements that can be operated on in parallel Cached in memory or on disk http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • 2014 MapR Technologies 14 RDD Operations - Expressive Transformations Creation of a new RDD dataset from an existing map, filter, distinct, union, sample, groupByKey, join, reduce, etc Actions Return a value after running a computation collect, count, first, takeSample, foreach, etc Check the documentation for a complete list http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd- operations
  • 2014 MapR Technologies 15 Easy: Clean API Resilient Distributed Datasets Collections of objects spread across a cluster, stored in RAM or on Disk Built through parallel transformations Automatically rebuilt on failure Operations Transformations (e.g. map, filter, groupBy) Actions (e.g. count, collect, save) Write programs in terms of transformations on distributed datasets
  • 2014 MapR Technologies 16 Easy: Expressive API map reduce
  • 2014 MapR Technologies 17 Easy: Expressive API map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
  • 2014 MapR Technologies 18 Easy: Example Word Count Spark Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 2014 MapR Technologies 19 Easy: Example Word Count Spark Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 2014 MapR Technologies 20 Easy: Works Well With Hadoop Data Compatibility Access your existing Hadoop Data Use the same data formats Adheres to data locality for efficient processing Deployment Models Standalone deployment YARN-based deployment Mesos-based deployment Deploy on existing Hadoop cluster or side-by-side
  • 2014 MapR Technologies 21 Easy: User-Driven Roadmap Language support Improved Python support SparkR Java 8 Integrated Schema and SQL support in Sparks APIs Better ML Sparse Data Support Model Evaluation Framework Performance Testing
  • 2014 MapR Technologies 22 Example: Logistic Regression data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print Final w: %s % w
  • 2014 MapR Technologies 23 Fast: Logistic Regression Performance 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration rst iteration 80 s further iterations 1 s
  • 2014 MapR Technologies 24 Easy: Multi-language Support Python lines = sc.textFile(...) lines.filter(lambda s: ERROR in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(ERROR)).count() Java JavaRDD lines = sc.textFile(...); lines.filter(new Function() { Boolean call(String s) { return s.contains(error); } }).count();
  • 2014 MapR Technologies 25 Easy: Interactive Shell Scala based shell % /opt/mapr/spark/spark-0.9.1/bin/spark-shell scala> val logs = sc.textFile("hdfs:///user/keys/logdata)" scala> logs.count()" " res0: Long = 232681 scala> logs.lter(l => l.contains("ERROR")).count()" ." res1: Long = 205 Python based shell as well - pyspark
  • 2014 MapR Technologies 26 2014 MapR Technologies Fault Tolerance and Performance
  • 2014 MapR Technologies 27 Fast: Using RAM, Operator Graphs In-memory Caching Data Partitions read from RAM instead of disk Operator Graphs Scheduling Optimizations Fault Tolerance = cached partition = RDD join lter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map