The How and Why of Fast Data Analytics with Apache Spark

25
The How and Why of Fast Data Analytics with Apache Spark with Justin Pihony @JustinPihony

Transcript of The How and Why of Fast Data Analytics with Apache Spark

Page 1: The How and Why of Fast Data Analytics with Apache Spark

The How and Why of Fast Data Analytics with Apache Spark

with Justin Pihony @JustinPihony

Page 2: The How and Why of Fast Data Analytics with Apache Spark

Today’s agenda:▪ Concerns▪ Why Spark?▪ Spark basics▪ Common pitfalls▪ We can help!

2

Page 3: The How and Why of Fast Data Analytics with Apache Spark

Target Audience

3

Page 4: The How and Why of Fast Data Analytics with Apache Spark

Concerns▪ Am I too small?

4

▪ Will switching be too costly?▪ Can I utilize my current infrastructure?▪ Will I be able to find developers?▪ Are there enough resources available?

Page 5: The How and Why of Fast Data Analytics with Apache Spark

Why Spark?

5

grep?

Page 6: The How and Why of Fast Data Analytics with Apache Spark

Why Spark?

6

Page 7: The How and Why of Fast Data Analytics with Apache Spark

object WordCount{

def main(args: Array[String])){

val conf = new SparkConf()

.setAppName("wordcount")

val sc = new SparkContext(conf)

sc.textFile(args(0))

.flatMap(_.split(" "))

.countByValue

.saveAsTextFile(args(1))

}}

7

public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

Tiny CodeBig CodeWhy Spark?

Page 8: The How and Why of Fast Data Analytics with Apache Spark

Why Spark?

8

Readability

Expressiveness

Fast

Testability

Interactive

Fault Tolerant

Unify Big Data

Page 9: The How and Why of Fast Data Analytics with Apache Spark

9

The MapReduce Explosion

Page 10: The How and Why of Fast Data Analytics with Apache Spark

10

Page 11: The How and Why of Fast Data Analytics with Apache Spark

“Spark will kill MapReduce, but save Hadoop.”

- http://insidebigdata.com/2015/12/08/big-data-industry-predictions-2016/

Page 12: The How and Why of Fast Data Analytics with Apache Spark
Page 13: The How and Why of Fast Data Analytics with Apache Spark

Big Data Unified API

13

Spark Core

SparkSQL

SparkStreaming

MLlib(machinelearning)

GraphX(graph)

DataFrames

Page 14: The How and Why of Fast Data Analytics with Apache Spark

14

Yahoo!

Who Is Using Spark?

Page 15: The How and Why of Fast Data Analytics with Apache Spark

Spark Mechanics

15

Worker WorkerWorker

Driver

Page 16: The How and Why of Fast Data Analytics with Apache Spark

Spark Mechanics

16

Spark Context

Worker WorkerWorker

Driver

Page 17: The How and Why of Fast Data Analytics with Apache Spark

Spark Context

17

Task creatorSchedulerData localityFault tolerance

Page 18: The How and Why of Fast Data Analytics with Apache Spark

RDD

18

▪ Resilient Distributed Dataset▪ Transformations

-map- filter-…

▪ Actions-collect-count- reduce-…

Page 19: The How and Why of Fast Data Analytics with Apache Spark

Expressive and Interactive

19

Page 20: The How and Why of Fast Data Analytics with Apache Spark

Built-in UI

20

Page 21: The How and Why of Fast Data Analytics with Apache Spark

Common Pitfalls

▪ Functional▪ Out of memory▪ Debugging▪ …

21

Page 22: The How and Why of Fast Data Analytics with Apache Spark

Concerns▪ Am I too small?

22

▪ Will switching from MapReduce be too costly?▪ Can I utilize my current infrastructure?▪ Will I be able to find developers?▪ Are there enough resources available?

Page 23: The How and Why of Fast Data Analytics with Apache Spark

Q & A

23

Page 24: The How and Why of Fast Data Analytics with Apache Spark

EXPERT SUPPORT Why Contact Typesafe for Your Apache Spark Project?

Ignite your Spark project with 24/7 production SLA, unlimited expert support and on-site training:

• Full application lifecycle support for Spark Core, Spark SQL & Spark Streaming

• Deployment to Standalone, EC2, Mesos clusters • Expert support from dedicated Spark team • Optional 10-day “getting started” services

package

Typesafe is a partner with Databricks, Mesosphere and IBM.

Learn more about on-site trainingCONTACT US

Page 25: The How and Why of Fast Data Analytics with Apache Spark

©Typesafe 2016 – All Rights Reserved