Apache spark - History and market overview

Apache Spark History and market overview Martin Zapletal Cake Solutions

Transcript of Apache spark - History and market overview

Page 1: Apache spark - History and market overview

Apache Spark

History and market overview

Martin Zapletal Cake Solutions

Page 2: Apache spark - History and market overview

Apache Spark and Big Data

1) History and market overview

2) Installation

3) MLlib and machine learning on Spark

4) Porting R code to Scala and Spark

5) Concepts - Core, SQL, GraphX, Streaming

6) Spark’s distributed programming model

7) Deployment

Page 3: Apache spark - History and market overview

Table of contents

● Motivation - why distributed data processing

● Market overview

● Brief history

● Hadoop MapReduce

● Apache Spark

● Other competitors

● Q & A

Page 4: Apache spark - History and market overview


● production of data in 2002 around 5 exabytes/800

megabytes per person. Even more TV, radio, phone.

● doubled from 1999

● importance of data for business and society

● must stored, processed, analysed to get the value

● 3Vs of data

o Volume

o Velocity

o Variety

Page 5: Apache spark - History and market overview

Distributed computing

● from supercomputers to cloud

o economical reasons

o gradual upgrades

o fault tolerance

o scalability

o versatility

o development speed

o ecosystem and tooling

o geographical distribution

o various models and technologies

Page 6: Apache spark - History and market overview

Distributed computing

● largest Yahoo Hadoop cluster has 4,500 nodes. 40,000 nodes in total. 455


● Facebook Hadoop 2000 nodes, each 12TB storage, 32GB RAM, 8-16


● Yahoo Kafka 20 gigabytes/second, LinkedIn 460,000 writes/sec,

2,300,000 reads/sec

● MongoDB 100 nodes, 20-30TB

Page 7: Apache spark - History and market overview

Distributed computing

● need for new tools, approaches, philosophy, languages, theory

● 7 fallacies of distributed computing

o the network is reliable, the latency is 0, the network is secure

● complexity

o packet loss, ordering, acknowledgement, time, synchronization,

reliable delivery

o many possible states and possibilities

o ubiquitous failures and impact of the distribution

● deployment

● theory

Page 8: Apache spark - History and market overview

Big Data technologies

● distributed computing frameworks

o batch

o stream

● machine learning and data mining

● support tools

● message queues

● databases

● distributed computing primitives

● cluster operating systems, schedulers

● deployment tools

Page 9: Apache spark - History and market overview

Big Data technologies

Page 10: Apache spark - History and market overview

Distributing computation

● efficient use of resources

● ensuring the computation completes

● ensuring correct result

● different levels of abstraction

o gpu

o processes

o threads

o actors

o actor clusters and virtualized actors

o frameworks on top of actors

o distributed computing frameworks

● different computing models

o share nothing

o shared memory

o actors

o mapReduce

Page 11: Apache spark - History and market overview

Distributing computation

t1 t2 t3 t4 t5 t6

Data Network Computation

Page 12: Apache spark - History and market overview

Distributing computation

t1 t2 t3

Page 13: Apache spark - History and market overview

Distributing computation

t1 t2 t3

Page 14: Apache spark - History and market overview

Brief history

● Google File System 2003

● MapReduce 2004

● BigTable 2006

● Dremel 2008

● Colossus 2011

● Spanner 2012

● Amazon Dynamo 2002

Page 15: Apache spark - History and market overview

Brief history

● Apache Hadoop

o HDFS file system

o HBase database

o MapReduce

o Apache Mahout

o Apache Hive

o Apache Pig

o Apache Drill

o Yarn resource management etc.

Page 16: Apache spark - History and market overview

Hadoop MapReduce

Page 17: Apache spark - History and market overview

Hadoop MapReducepublic class WordCount {

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {


output.collect(word, one);




public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();


output.collect(key, new IntWritable(sum));



public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);









FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));




Page 18: Apache spark - History and market overview

Apache Spark

● developed at UC Berkeley, now OS

● written in Scala, uses Akka

● compatible with existing Hadoop infrastructure

● api for Java, Scala, Python

● simple, expressive, functional and high level programming model

● speed

● in memory caching, query optimizations

● suitable for iterative and ad-hoc queries (ideal for ML)

● used in production in Yahoo, Amazon, ..

● Databricks raised ~$47M in last year

val file = spark.textFile("hdfs://...")

val counts = file.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)


Page 19: Apache spark - History and market overview

Apache Spark


● deployment, installation and programming model and what is actually happening in

the background in the next talks

Page 20: Apache spark - History and market overview


● non exhaustive list

● Akka cluster/remoting

o lower level abstraction

o more work for the developer

o more freedom

Page 21: Apache spark - History and market overview


● Intel GearPump

o build on top of Akka

o scalable, fault-tolerant and expressive solution

o distributed streaming data solution competing with for example Storm

Page 22: Apache spark - History and market overview


● Apache Flinko written in Java, started in 2008 at the Technical University of Berlin, the Humboldt

University of Berlin, and the Hasso Plattner Institute

o ASF Top-Level Project since early 2015

o fast

o cost based query optimizers that generalizes relational database query optimizers to

distributed environment

o streaming

o api similar to Spark

Page 23: Apache spark - History and market overview


● Apache Tezo developed by Hortonworks, became ASF Top-Level since July 2014

o generalizes MapReduce to a more powerful framework based on expressing computations

as dataflow graph

o much richer api

o lower level than Spark or Flink allowing some extra optimizations

Page 24: Apache spark - History and market overview


● Apache Samza

o developed at LinkedIn, joined ASF in September 2013

o distributed stream processing framework

o uses Kafka (also developed at LinkedIn) and other data sources

● Apache Storm

o distributed unbounded stream processing framework

o programming api to define graph topologies

using Spouts (sources) and Bolts (processing nodes)

o used at Yahoo, Twitter, Yelp, Spotify, ...

Page 25: Apache spark - History and market overview


● why distributed computing frameworks

● why Spark?

o concepts based on theory

o young and progressive, written in Scala

o already mature and production proven

o distributed computing, Big Data, data analysis increasingly important

o potential to replace market leading MapReduce in Hadoop ecosystem

● why not?

o many competitors

o Spark may not always be the best fit

Page 26: Apache spark - History and market overview
