Evolution of apache spark

30
Evolution of Apache Spark Journey of Spark in 1.x series

Transcript of Evolution of apache spark

Evolution of Apache SparkJourney of Spark in 1.x series

● Madhukara Phatak

● Technical Lead at Tellius

● Consultant and Trainer at datamantra.io

● Consult in Hadoop, Spark and Scala

● www.madhukaraphatak.com

Agenda● Spark 1.0● State of Big data● Change in ecosystem● Dawn of structured data● Working with structured sources● Dawn of custom memory management● Evolution of Libraries

Spark 1.0● Release on May 2014 [1]

● First production ready, backward compatible release● Contains

○ Spark batch○ Spark streaming○ Shark○ MLLib and Graphx

● Developed over 4 years● Better hadoop

State of Big data Industry● Map/Reduce was the way to do big data processing● HDFS was primary source of the data● Tools like Sqoop developed for moving data to hdfs and

hdfs acted like single point of source● Every data by default assumed to be unstructured and

structure was laid on top of it● Hive and Pig were popular ways to do structured and

semi structured data processing on top of Map/Reduce

Spark 1.0 Ideas● RDD abstraction was supported to do Map/Reduce style

programming● Primary source supported was HDFS and memory as

the speedup layer● Spark-streaming viewed as faster batch processing

rather than as streaming● To support Hive, Shark was created to generate RDD

code rather than Map/Reduce

Changes from 2014● Big data industry has gone through many radical

changes in thinking in last two years● Some of those changes started in spark and some other

are influenced by other frameworks● These changes are important to understand why Spark

2.0 abstractions are radically different than Spark 1.0● Many of these are already discussed in earlier meetups,

links to the videos are in reference

Dawn of Structured Data

Usage of Big data in 2014.● Most of the people were using higher level tools like

Hive and Pig to process data rather using Map/Reduce● Most of the data was residing in the RDBMS databases

and user ETL data from mysql to hive to query● So lot of use cases were analysing structured data

rather than basic assumption of unstructured in big data world

● Huge time is consumed for ETL and non optimized workflows from Hive

Spark with Structured Data in 1.2● Spark recognised need of structured data in the market

and started to evolve the platform to support that use case

● First attempt was to have a specialised RDD called SchemaRDD in Spark 1.2 which represented that schema

● But this approach was not clean ● Also even though there was InputFormat to read from

structured data, there was no direct API to read from Spark

DataSource API in Spark 1.3● First API to provide an unified API to read from

structured and semi structured sources● Can read from RDBMS, NoSql databases like

Mongodb,Cassandra etc● Advanced API like InputFormat which gives lot of

control to source to optimize locality of data● So in Spark 1.3, spark addressed the need of structured

data being first class in Big data ecosystem ● For more info refer to, Anatomy of DataSource API talk[2]

DataFrame abstraction in Spark● Spark understood modifying the RDD abstraction is not

good enough● Many frameworks like Hive, Pig tried and failed mapping

querying efficiently on Map/Reduce● So Spark came up with Dataframe abstraction which

goes through a complete different pipeline that of RDD which is highly optimized

● For more info refer to, Anatomy of DataFrame API talk [3]

Evolution of InMemory processing

In memory in Spark 1.0● Spark was the first open source big data framework to

embrace in memory computing● With cheaper hardware and abstractions like RDD

allowed spark to exploit memory in efficient way than all other hadoop ecosystem projects

● The first implementation of in memory computing followed typical cache approach of keeping serialized java bytes

● This proved to be challenging in future

Challenges of in memory in Java● As more and more big data frameworks started to

exploit memory, soon they realised few limitation of Java memory model

● Java memory is tuned for short lived objects and complete control of memory is given to JVM

● But big data system started using JVM for long term storage, JVM memory model started feel inadequate

● Also as java heap grew, to cache more data, GC pauses started to kill performance

Custom memory management ● Apache Flink is first big data system to implement

custom memory management in java ● Flink follows Dataframe like API with custom memory

model● The custom memory model with non GC based

approach proved to be highly successful● By observing trends in community, optly Spark also

adopted same in Spark 1.4

Tungsten in Spark 1.4● Spark release first version of custom memory

management in 1.4 version● It was only supported DF as they need custom memory

model ● Custom memory management greatly improved use of

spark in higher vm size and fewer GC paused● Solved OOM issues which plagued earlier versions of

spark● For more info refer to, Anatomy of In memory

management in Spark talk [4]

DSL’s for data processing

RDD and Map/Reduce API API● RDD API of spark follows functional programming

paradigm which is similar to Map/Reduce ● RDD API passes around opaque function objects which

is great for programming but bad for system based optimization

● Map/Reduce API of Java also follows same patterns but less elegant than scala ones

● Hard to optimise compared to Pig/Hive● So we saw a steady increase in custom DSL’s in

hadoop world

Need of DSL’s in Hadoop● DSL’s like Pig or Hive are much more easier to

understand compare to Java API● Less error prone and helps to be very specific● Can be easily optimised, as DSL only focuses on what

to do not how to do● As Java Map/Reduce mixes what with how, it’s hard to

optimize compare to Hive and Pig● So more and more people prefered these DSL over

platform level API’s

Challenges of DSL in Hadoop● Hive and Pig DSL do not integrate well with

Map/Reduce API’s● DSL often lack the flexibility of complete programming

language● Hive/Pig DSL don’t define single abstraction to share so

you will be not able mix● DSL are powerful for optimization but soon become

limited in terms of functionality

Scala as language to host DSL● Scala is one of the first language to embrace DSL as

the first class citizens● Scala features like implicits, higher order functions,

structured types etc allow easily build DSL’s and integrate with language

● This allows any library on scala to integrate DSL and harness full power of language

● Many libraries define their own DSL outside big data. Ex : Slick, Akka-http, Sbt

DF DSL and Spark SQL DSL● To harness power of custom memory management and

hive like optimizes spark encourages to write DF and spark sql DSL over spark RDD code

● Whenever we write this DSL, all the features of scala language and its libraries are available,which makes it more powerful that Pig/ Hive

● Other frameworks like Flink, Beam follow same ideas on scala, Java 8 etc

● You can easily mix and match DSL with RDD API

Dataset DSL in Spark 1.6● Dataframe DSL introduced in 1.4 and stabilised in 1.5● As spark observed the user and performance benefits of

DSL based programming, it wanted to make as import pillar of Spark

● So in Spark 1.6, Spark released Dataset DSL which is poised to complete RDD API from user land

● This indicates a big shift in thinking as we are more and more moving away from 1.0 Map/Reduce and unstructured mindset.

Evolution of Libraries

Evolution of libraries vs frameworks● Spark is one of the first big data framework to build

platform rather than collection of frameworks● Single abstraction results in multiple libraries not

multiple frameworks● All these libraries get benefits from the improvements in

run time● This made spark to build lot of ecosystem in very less

time● To understand the meaning of platform, refer to

Introduction to Flink talk [5]

Data exchange between Libraries● As more and more libraries are added to spark, having

common way to exchange data became important● Initially libraries started using RDD as data exchange

format, but soon discovered some limitations● Limitations of RDD as data exchange format is

○ No defined schema. Need to come up with domain object for each library

○ Too low level○ Custom serialization is hard to integrate

DataFrame as data exchange format● From last few release, spark is making Dataframe as

new data exchange format of Spark● Dataframe has schema and can be easily passed

around between libraries● Dataframe is higher level abstraction compared RDD● As Dataframe are serialized using platform specific

code generation, all libraries will be following same serialization

● Dataset will follow the same advantages

Learnings from Spark 1.x● Structured/Semi structured data is the first class of Big

data processing system● Custom memory management and code generated

serialization gives best performance on JVM● DataFrame/ Dataset are the new abstraction layers to

build next generation big data processing system● DSL is the way forward over Map/Reduce like API’s● Having high level structured abstractions make libraries

coexist happily on a platform