Create Now Design Tour Special「CS vs CC徹底比較セミナー・特別編」ユーザー目線からのCC VS CS徹底比較
Spark徹底入門 #cwt2015
-
Upload
cloudera-japan -
Category
Technology
-
view
4.413 -
download
0
Transcript of Spark徹底入門 #cwt2015
-
Cloudera, Inc. All rights reserved.
Cloudera World Tokyo 2015
SparkCloudera [email protected]
-
Cloudera, Inc. All rights reserved.
Spark
-
Cloudera, Inc. All rights reserved.
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. - http://spark.apache.org
SparkMapReduce100(10
-
Cloudera, Inc. All rights reserved.
100
100x
-
Cloudera, Inc. All rights reserved.
Spark
HDFS, HBase, Kudu,
YARN
Spark Hadoop MapReduce Search Others Impala
-
Cloudera, Inc. All rights reserved.MapReduce
-
Cloudera, Inc. All rights reserved.
MapReduce:
: Mapper
: Mapper
:
MapReduceMap Map Map Map Map Map Map Map Map Map Map Map
Reduce Reduce Reduce Reduce
-
Cloudera, Inc. All rights reserved.
- MapReduce
13
72.165.33.132 - - [04/Nov/28.114.157.122 - - [04/No\52.93.117.198 - - [04/Nov/
168.90.228.205 - - [04/Nov/28.42.27.49 - - [04/Nov/201192.120.64.138 - - [04/Nov/
156.189.222.57 - - [04/Nov/2164.219.215.208 - - [04/Nov/84.42.208.90 - - [04/Nov/20
164.39.210.117 - - [04/Nov/196.144.35.85 - - [04/Nov/280.78.35.71 - - [04/Nov/201
Application
-
Cloudera, Inc. All rights reserved.
MapReduce - Map Map
72.165.33.132 - - [04/Nov/28.114.157.122 - - [04/No\52.93.117.198 - - [04/Nov/
168.90.228.205 - - [04/Nov/28.42.27.49 - - [04/Nov/201192.120.64.138 - - [04/Nov/
156.189.222.57 - - [04/Nov/2164.219.215.208 - - [04/Nov/84.42.208.90 - - [04/Nov/20
164.39.210.117 - - [04/Nov/196.144.35.85 - - [04/Nov/280.78.35.71 - - [04/Nov/201
Application
Task
Task
Task
Task
14
-
Cloudera, Inc. All rights reserved.
MapReduce - Reduce Reduce
15
72.165.33.132, 172.165.33.132, 172.165.33.132, 172.165.33.145, 1
168.90.228.205,1168.90.228.205,1192.120.64.138,1
156.189.222.57,1156.189.222.57,1164.219.215.208,1
164.39.210.117,1164.39.210.117,1164.39.210.118.1
Task
-
Cloudera, Inc. All rights reserved.
Hadoop MapReduceSpark
-
Cloudera, Inc. All rights reserved.
Speed
-
Cloudera, Inc. All rights reserved.
MapReduce
Map Map Map Map Map Map Map Map Map Map Map Map
Reduce Reduce Reduce Reduce
-
Cloudera, Inc. All rights reserved.
Map
ReduceMap
Map Reduce
Map
Map
ReduceMap
Map Reduce
Map
Reduce
Map
Map
-
Cloudera, Inc. All rights reserved.
Map Reduce Map Reduce
Map Map ReduceXMapX
-
Cloudera, Inc. All rights reserved.
Map Reduce
Map Reduce
-
Cloudera, Inc. All rights reserved.
: 18 3
64-128GB RAM
16 cores
50 GB per second
-
Cloudera, Inc. All rights reserved.
(DAG)
join
filter
groupBy
B: B:
C: D: E:
F:
map
A:
map
take
= cached partition= RDD
-
Cloudera, Inc. All rights reserved.
::
110sec80sec
-
Cloudera, Inc. All rights reserved.
:2
+110sec
+1sec110sec
80sec
-
Cloudera, Inc. All rights reserved.
()
0500
1000150020002500300035004000
1 5 10 20 30Run
ning
Tim
e(s)
# of Iterations
MapReduce
Spark
110 s/
=80s1s
-
Cloudera, Inc. All rights reserved.
-
Cloudera, Inc. All rights reserved.
API
ScalaJavaPython
Pythonlines = sc.textFile(...) lines.filter(lambda s: ERROR in s).count()
Scalaval lines = sc.textFile(...) lines.filter(s => s.contains(ERROR)).count()
JavaJavaRDD lines = sc.textFile(...); lines.filter(new Function() { Boolean call(String s) { return s.contains(error); } }).count();
-
Cloudera, Inc. All rights reserved.
percolateur:spark srowen$ ./bin/spark-shell --master local[*]...Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)Type in expressions to have them evaluated.Type :help for more information....
scala> val words = sc.textFile("file:/usr/share/dict/words")...words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at :21
scala> words.count...res0: Long = 235886
scala>
-
Cloudera, Inc. All rights reserved.
Word Count public class WordCount { public static void main(String[] args) throws Exception { Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } } public class WordMapper extends Mapper { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) context.write(new Text(word), new IntWritable(1)); } } } } public class SumReducer extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (IntWritable value : values) { wordCount += value.get(); } context.write(key, new IntWritable(wordCount)); } }
sc.textFile(file) \ .flatMap(lambda s: s.split()) \ .map(lambda w: (w,1)) \ .reduceByKey(lambda v1,v2: v1+v2) .saveAsTextFile(output) MapReduce
2-5x
-
Cloudera, Inc. All rights reserved.
RDD
Resilient Distributed Datasets (RDD)
-
Cloudera, Inc. All rights reserved.
I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
I've never seen a purple cow. I never hope to see one; But I can tell you, anyhow, I'd rather see than be one.
File: purplecow.txt
RDD: mydata
RDD: mydata_uc
RDD: mydata_filt
> mydata = sc.textFile("purplecow.txt")
> mydata_uc = mydata.map(lambda line: line.upper())
> mydata_filt = \ mydata_uc.filter(lambda line: \ line.startswith('I'))
> mydata_filt.count() 3
-
Cloudera, Inc. All rights reserved.
Hue Notebook
-
Cloudera, Inc. All rights reserved.
Jupyter/IPython Notebook
-
Cloudera, Inc. All rights reserved.
SparkSQLMLlibSparkR
-
Cloudera, Inc. All rights reserved.
SparkSQL
Spark/JavaSparkSQL
SparkSpark
SQLJavaScala
SQL (. )
-
Cloudera, Inc. All rights reserved.
MLlibSpark(ML)
MLlibSpark
-
Cloudera, Inc. All rights reserved.
Streaming
-
Cloudera, Inc. All rights reserved.
Hadoop MapReduce
-
Cloudera, Inc. All rights reserved.
Spark Streaming SparkAPI
datadatadatadatadatadatadatadata Live Datat=0 t=1 t=2 t=3
DStreamdata
data
data
data
RDD @ t=1data
data
data
data
RDD @ t=2data
data
data
data
RDD @ t=3
-
Cloudera, Inc. All rights reserved.
Spark Streaming
: 5
MLlib
-
Cloudera, Inc. All rights reserved.
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t
batch @ t+2tweets DStream
hashTags DStream
-
Cloudera, Inc. All rights reserved.SparkICU 54 2015 Cloudera, Inc. All rights reserved.
-
Cloudera, Inc. All rights reserved.
http://blog.cloudera.com/blog/2015/07/designing-fraud-detection-architecture-that-works-like-your-brain-does/
-
Cloudera, Inc. All rights reserved.
SparkHadoop
-
Cloudera, Inc. All rights reserved.
http://itpro.nikkeibp.co.jp/atcl/column/14/072800028/073000001/
MapReduce- Doug Cutting
-
Cloudera, Inc. All rights reserved.http://www.cloudera.co.jp/blog/one-platform-initiative.html
Hadoop
No.
- Mike Olson
-
Cloudera, Inc. All rights reserved.
MapReduce
Hive, Pig Sqoop distcp
-
Cloudera, Inc. All rights reserved.
SparkMapReduce
Stage 1
Crunch on SparkSearch on Spark
Stage 2
Hive on Spark (beta)Spark on HBase (beta)
Stage 3
Pig on Spark (alpha)Sqoop on Spark
ClouderaSpark
-
Cloudera, Inc. All rights reserved.
MapReduceSpark
-
Cloudera, Inc. All rights reserved.
Hadoop
Spark
Impala
Solr
MapReduceIO
:
-
Cloudera, Inc. All rights reserved.
Spark Hadoop
Spark Streaming MLlib SparkSQL GraphX
Data-frames SparkR
HDFS, HBase
YARN
Spark Impala MR Others Search
-
Cloudera, Inc. All rights reserved.
ClouderaSpark
2013 2014 2015 2016
Spark
CDH4.4Spark
Spark on YARN
Spark
Spark
ClouderaSpark
-
Cloudera, Inc. All rights reserved.
ClouderaCore Spark Spark Streaming
ETL 20
Jaccard
ERP
(OCR)
(LDA)
1010
-
Cloudera, Inc. All rights reserved.
-
Cloudera, Inc. All rights reserved.
Apache Spark
Speed Easy Streaming
-
Cloudera, Inc. All rights reserved.
-
Cloudera, Inc. All rights reserved.
-
Cloudera, Inc. All rights reserved.
Cloudera
Apache Spark Spark & Hadoop I(New)
http://cloudera.co.jp/university
-
Cloudera, Inc. All rights reserved.
SparkSparkSparkOReilly Advanced Analytics with Spark (written by Clouderans)Apache HadoopCloudera Developer Blog
Cloudera Quick Start VM Spark http://codezine.jp/article/corner/583
-
Cloudera, Inc. All rights reserved.
Cloudera Live
cloudera.com/live
CDH
-
Cloudera, Inc. All rights reserved.
Cloudera
We are Hiring!