Apache Spark — •³¾€...

download Apache Spark — •³¾€ °…¾¼¾²

of 36

  • date post

    15-Jan-2015
  • Category

    Technology

  • view

    2.450
  • download

    2

Embed Size (px)

description

Мы поговорим об Apache Spark — более быстром, универсальном и user friendly аналоге Hadoop как инструменте для batch-обработки больших данных. Рассмотрим архитектуру Spark и его главного строительного блока — RDD. Сравним код в MapReduce и RDD моделях. Обсудим развитие других Apache top-level проектов и плавный отход индустрии от MapReduce модели к Spark.

Transcript of Apache Spark — •³¾€...

  • 1. 1

2. 2Apache Spark . , 30.08.2014 3. 3 ! Apache Solr based search engine e-commerce! . ! Spark Contributor 4. 4 ! , ! ! 5. 5Hadoop ! 6. 6HDFS HDSFClient NameNode SecondaryNameNodeDataNodeDataNodeDataNodeDataNodeDataNode 7. 7 map-reduce 8. 8public static class Map extends Mapper{private final static IntWritable one = new IntWritable(1);private Text word = new Text();@Overridepublic void map(LongWritable key, Text value, Context context) throwsIOException, InterruptedException{String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()){word.set(tokenizer.nextToken());context.write(word, one);}}}public static class Reduce extends Reducer{@Overridepublic void reduce(Text key, Iterable values, Context context)throws IOException, InterruptedException{int sum = 0;for (IntWritable val : values){sum += val.get();}context.write(key, new IntWritable(sum));}} 9. 9 Hadoop! Fault tolerance! Data locality! 10. 10! map-reduce ! 3 ! 11. 11RDDResilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,Scott Shenker, Ion Stoica. NSDI 2012. April 2012. 12. 12Word countval wordCounts = spark.textFile("/user/usr/in").flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) 13. 13Whole program comparisonpublic class HadoopAnalyzer{public static class Map extends Mapper{private final static IntWritable one = new IntWritable(1);private Text word = new Text();@Overridepublic void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()){word.set(tokenizer.nextToken());context.write(word, one);}}}public static class Reduce extends Reducer{@Overridepublic void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException{int sum = 0;for (IntWritable val : values){sum += val.get();}context.write(key, new IntWritable(sum));}}public static void main(String[] args) throws Exception{JobConf conf = new JobConf(HadoopAnalyzer.class);conf.setJobName("wordcount");System.out.println("Executing job.");Job job = new Job(conf, "job");job.setInputFormatClass(InputFormat.class);job.setOutputFormatClass(OutputFormat.class);job.setJarByClass(HadoopAnalyzer.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);TextInputFormat.addInputPath(job, new Path("/user/usr/in"));TextOutputFormat.setOutputPath(job, new Path("/user/usr/out"));job.setMapperClass(Mapper.class);job.setReducerClass(Reducer.class);job.waitForCompletion(true);System.out.println("Done.");}} 14. 14Whole program comparisonpackage ru.yandex.spark.examplesimport org.apache.spark.{SparkConf, SparkContext}object WordCount {def main(args: Array[String]) {val conf = new SparkConf()conf.setAppName("serp-api")conf.setMaster("yarn-client")val spark = new SparkContext(conf)spark.addJar(SparkContext.jarOfClass(this.getClass).get)val wordCounts = spark.textFile("/user/usr/in").flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)wordCounts.saveAsTextFile("/user/usr/out")spark.stop()}} 15. 15Actionsdef collect(): Array[T]def saveAsTextFile(path: String)def toLocalIterator: Iterator[T]TransformationsIn one partition(narrow)def filter(f: T => Boolean): RDD[T]def map[U: ClassTag](f: T => U): RDD[U]def foreachPartition(f: Iterator[T] => Unit)def zipWithIndex(): RDD[(T, Long)]Cross-partition(wide)def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]def reduceByKey(func: (V,V) => V,numPartitions: Int):RDD[(K, V)] 16. 1640003500300025002000150010005001 5 10 20 30HadoopSpark () 17. 17Pythonfile = spark.textFile("hdfs://...")counts = file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b)counts.saveAsTextFile("hdfs://...")Scalaval file = spark.textFile("hdfs://...")val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...") 18. 18Java-8JavaRDD lines = sc.textFile("hdfs://log.txt");JavaRDD words =lines.flatMap(line -> Arrays.asList(line.split(" ")));JavaPairRDD counts =words.mapToPair(w -> new Tuple2(w, 1)).reduceByKey((x, y) -> x + y);counts.saveAsTextFile("hdfs://counts.txt"); 19. 19Java-7JavaRDD file = spark.textFile("hdfs://...");JavaRDD words = file.flatMap(new FlatMapFunction() {public Iterable call(String s) { return Arrays.asList(s.split("")); }});JavaPairRDD pairs = words.map(new PairFunction() {public Tuple2 call(String s) { return new Tuple2(s, 1); }});JavaPairRDD counts = pairs.reduceByKey(newFunction2() {public Integer call(Integer a, Integer b) { return a + b; }});counts.saveAsTextFile("hdfs://..."); 20. 20SQLHadoop SparkSpark SQL 21. 21 22. 22Hive Spark execution engine 23. 23 Hadoop SparkSparkML~ 24 algorithms ~ 13 algorithms 24. 24 import org.apache.spark.mllib.clustering.KMeansimport org.apache.spark.mllib.linalg.Vectors// Load and parse the dataval data = sc.textFile("data/kmeans_data.txt")val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))// Cluster the data into two classes using KMeansval numClusters = 2val numIterations = 20val clusters = KMeans.train(parsedData, numClusters, numIterations)// Evaluate clustering by computing Within Set Sum of Squared Errorsval WSSSE = clusters.computeCost(parsedData)println("Within Set Sum of Squared Errors = " + WSSSE) 25. 25! Classification and regression linear support vector machine (SVM) logistic regression linear least squares, Lasso, and ridge regression decision tree naive Bayes! Collaborative filtering alternating least squares (ALS) 26. 26! Clustering k-means! Dimensionality reduction singular value decomposition (SVD) principal component analysis (PCA)! Optimization stochastic gradient descent limited-memory BFGS (L-BFGS) 27. 27 Mahout Spark 28. 28GraphHadoop Spark 29. 29Streaming batch processingHadoop Spark 30. 30Streaming batch processingSparkstreamingSparkengineInputdata streamBatchesof input dataBatchesof processed data 31. 31 32. 32 33. Big Data33BatchInterac+veStreamingHadoopCassandraMesosCloudProviders API 34. 34 140012001000800600400200SparkStormHDFS350000300000250000200000150000MapReduce YARN50000SparkStormHDFSYARNMapReduce100000 6 35. 35 ! 36. 36 epahomov@yandex-team.ru