The How and Why of Fast Data Analytics with Apache Spark

with Justin Pihony @JustinPihony

Today’s agenda:▪ Concerns▪ Why Spark?▪ Spark basics▪ Common pitfalls▪ We can help!

Target Audience

Concerns▪ Am I too small?

▪ Will switching be too costly?▪ Can I utilize my current infrastructure?▪ Will I be able to find developers?▪ Are there enough resources available?

Why Spark?

object WordCount{

def main(args: Array[String])){

val conf = new SparkConf()

.setAppName("wordcount")

val sc = new SparkContext(conf)

sc.textFile(args(0))

.flatMap(_.split(" "))

.countByValue

.saveAsTextFile(args(1))

public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

Tiny CodeBig CodeWhy Spark?

Why Spark?

Readability

Expressiveness

Testability

Interactive

Fault Tolerant

Unify Big Data

The MapReduce Explosion

“Spark will kill MapReduce, but save Hadoop.”

- http://insidebigdata.com/2015/12/08/big-data-industry-predictions-2016/

Big Data Unified API

Spark Core

SparkSQL

SparkStreaming

MLlib(machinelearning)

GraphX(graph)

DataFrames

Yahoo!

Who Is Using Spark?

Spark Mechanics

Worker WorkerWorker

Driver

Spark Mechanics

Spark Context

Worker WorkerWorker

Driver

Spark Context

Task creatorSchedulerData localityFault tolerance

▪ Resilient Distributed Dataset▪ Transformations

-map- filter-…

▪ Actions-collect-count- reduce-…

Expressive and Interactive

Built-in UI

Common Pitfalls

▪ Functional▪ Out of memory▪ Debugging▪ …

Concerns▪ Am I too small?

▪ Will switching from MapReduce be too costly?▪ Can I utilize my current infrastructure?▪ Will I be able to find developers?▪ Are there enough resources available?

EXPERT SUPPORT Why Contact Typesafe for Your Apache Spark Project?

Ignite your Spark project with 24/7 production SLA, unlimited expert support and on-site training:

• Full application lifecycle support for Spark Core, Spark SQL & Spark Streaming

• Deployment to Standalone, EC2, Mesos clusters • Expert support from dedicated Spark team • Optional 10-day “getting started” services

package

Typesafe is a partner with Databricks, Mesosphere and IBM.

Learn more about on-site trainingCONTACT US

The How and Why of Fast Data Analytics with Apache Spark

Software

Transcript of The How and Why of Fast Data Analytics with Apache Spark

Apache Kafka with Spark Streaming: Real-time Analytics Redefined

Introduction Apache Spark - PRACE Research … DBGroup @ unimore Giovanni Simonini Giuseppe Fiameni Apache Spark Introduction School on Scientific Data Analytics and Visualization

Apache Spark and Oracle Stream Analytics

Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

Big Data Analytics and Apache Spark

Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Geo-Analytics with Apache Spark and In-Memory Data Grids

IBM Analytics for Apache Spark (Spark as a Service)files.meetup.com/7770922/Spark as a Service.pdfIBM Analytics for Apache Spark –Personas & Practitioners Data Scientist Application

Big Data Analytics with NATS and Apache Spark

Analytics Platform Leveraging Apache Spark for …dugi.molaro.be/wp-content/uploads/2017/03/Spark-and-ML.pdf · 2017-03-13 · IBM Analytics Platform Leveraging Apache Spark for IBM

Real time Analytics with Apache Kafka and Apache Spark

Apache Spark: The Analytics Operating System by Anjul Bhambhri

Scaling Analytics with Apache Spark

Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™

Real-Time Analytics with Apache Cassandra and Apache Spark

Tecnologie e metodologie di Big Data Analytics - Apache Spark · 2019-07-19 · Tecnologie e metodologie di Big Data Analytics - Apache Spark Simone Angelini Fondazione Ugo Bordoni

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)

Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Session #2442: Flash-Optimized Apache Spark: Expanding In ... · R Scala SQL Python Java Spark SQL Streaming MLlib GraphX #ibmedge Apache Spark 6 • Unified Analytics Platform –