Evolution of apache spark
-
Upload
datamantra -
Category
Technology
-
view
597 -
download
0
Transcript of Evolution of apache spark
● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at datamantra.io
● Consult in Hadoop, Spark and Scala
● www.madhukaraphatak.com
Agenda● Spark 1.0● State of Big data● Change in ecosystem● Dawn of structured data● Working with structured sources● Dawn of custom memory management● Evolution of Libraries
Spark 1.0● Release on May 2014 [1]
● First production ready, backward compatible release● Contains
○ Spark batch○ Spark streaming○ Shark○ MLLib and Graphx
● Developed over 4 years● Better hadoop
State of Big data Industry● Map/Reduce was the way to do big data processing● HDFS was primary source of the data● Tools like Sqoop developed for moving data to hdfs and
hdfs acted like single point of source● Every data by default assumed to be unstructured and
structure was laid on top of it● Hive and Pig were popular ways to do structured and
semi structured data processing on top of Map/Reduce
Spark 1.0 Ideas● RDD abstraction was supported to do Map/Reduce style
programming● Primary source supported was HDFS and memory as
the speedup layer● Spark-streaming viewed as faster batch processing
rather than as streaming● To support Hive, Shark was created to generate RDD
code rather than Map/Reduce
Changes from 2014● Big data industry has gone through many radical
changes in thinking in last two years● Some of those changes started in spark and some other
are influenced by other frameworks● These changes are important to understand why Spark
2.0 abstractions are radically different than Spark 1.0● Many of these are already discussed in earlier meetups,
links to the videos are in reference
Usage of Big data in 2014.● Most of the people were using higher level tools like
Hive and Pig to process data rather using Map/Reduce● Most of the data was residing in the RDBMS databases
and user ETL data from mysql to hive to query● So lot of use cases were analysing structured data
rather than basic assumption of unstructured in big data world
● Huge time is consumed for ETL and non optimized workflows from Hive
Spark with Structured Data in 1.2● Spark recognised need of structured data in the market
and started to evolve the platform to support that use case
● First attempt was to have a specialised RDD called SchemaRDD in Spark 1.2 which represented that schema
● But this approach was not clean ● Also even though there was InputFormat to read from
structured data, there was no direct API to read from Spark
DataSource API in Spark 1.3● First API to provide an unified API to read from
structured and semi structured sources● Can read from RDBMS, NoSql databases like
Mongodb,Cassandra etc● Advanced API like InputFormat which gives lot of
control to source to optimize locality of data● So in Spark 1.3, spark addressed the need of structured
data being first class in Big data ecosystem ● For more info refer to, Anatomy of DataSource API talk[2]
DataFrame abstraction in Spark● Spark understood modifying the RDD abstraction is not
good enough● Many frameworks like Hive, Pig tried and failed mapping
querying efficiently on Map/Reduce● So Spark came up with Dataframe abstraction which
goes through a complete different pipeline that of RDD which is highly optimized
● For more info refer to, Anatomy of DataFrame API talk [3]
In memory in Spark 1.0● Spark was the first open source big data framework to
embrace in memory computing● With cheaper hardware and abstractions like RDD
allowed spark to exploit memory in efficient way than all other hadoop ecosystem projects
● The first implementation of in memory computing followed typical cache approach of keeping serialized java bytes
● This proved to be challenging in future
Challenges of in memory in Java● As more and more big data frameworks started to
exploit memory, soon they realised few limitation of Java memory model
● Java memory is tuned for short lived objects and complete control of memory is given to JVM
● But big data system started using JVM for long term storage, JVM memory model started feel inadequate
● Also as java heap grew, to cache more data, GC pauses started to kill performance
Custom memory management ● Apache Flink is first big data system to implement
custom memory management in java ● Flink follows Dataframe like API with custom memory
model● The custom memory model with non GC based
approach proved to be highly successful● By observing trends in community, optly Spark also
adopted same in Spark 1.4
Tungsten in Spark 1.4● Spark release first version of custom memory
management in 1.4 version● It was only supported DF as they need custom memory
model ● Custom memory management greatly improved use of
spark in higher vm size and fewer GC paused● Solved OOM issues which plagued earlier versions of
spark● For more info refer to, Anatomy of In memory
management in Spark talk [4]
RDD and Map/Reduce API API● RDD API of spark follows functional programming
paradigm which is similar to Map/Reduce ● RDD API passes around opaque function objects which
is great for programming but bad for system based optimization
● Map/Reduce API of Java also follows same patterns but less elegant than scala ones
● Hard to optimise compared to Pig/Hive● So we saw a steady increase in custom DSL’s in
hadoop world
Need of DSL’s in Hadoop● DSL’s like Pig or Hive are much more easier to
understand compare to Java API● Less error prone and helps to be very specific● Can be easily optimised, as DSL only focuses on what
to do not how to do● As Java Map/Reduce mixes what with how, it’s hard to
optimize compare to Hive and Pig● So more and more people prefered these DSL over
platform level API’s
Challenges of DSL in Hadoop● Hive and Pig DSL do not integrate well with
Map/Reduce API’s● DSL often lack the flexibility of complete programming
language● Hive/Pig DSL don’t define single abstraction to share so
you will be not able mix● DSL are powerful for optimization but soon become
limited in terms of functionality
Scala as language to host DSL● Scala is one of the first language to embrace DSL as
the first class citizens● Scala features like implicits, higher order functions,
structured types etc allow easily build DSL’s and integrate with language
● This allows any library on scala to integrate DSL and harness full power of language
● Many libraries define their own DSL outside big data. Ex : Slick, Akka-http, Sbt
DF DSL and Spark SQL DSL● To harness power of custom memory management and
hive like optimizes spark encourages to write DF and spark sql DSL over spark RDD code
● Whenever we write this DSL, all the features of scala language and its libraries are available,which makes it more powerful that Pig/ Hive
● Other frameworks like Flink, Beam follow same ideas on scala, Java 8 etc
● You can easily mix and match DSL with RDD API
Dataset DSL in Spark 1.6● Dataframe DSL introduced in 1.4 and stabilised in 1.5● As spark observed the user and performance benefits of
DSL based programming, it wanted to make as import pillar of Spark
● So in Spark 1.6, Spark released Dataset DSL which is poised to complete RDD API from user land
● This indicates a big shift in thinking as we are more and more moving away from 1.0 Map/Reduce and unstructured mindset.
Evolution of libraries vs frameworks● Spark is one of the first big data framework to build
platform rather than collection of frameworks● Single abstraction results in multiple libraries not
multiple frameworks● All these libraries get benefits from the improvements in
run time● This made spark to build lot of ecosystem in very less
time● To understand the meaning of platform, refer to
Introduction to Flink talk [5]
Data exchange between Libraries● As more and more libraries are added to spark, having
common way to exchange data became important● Initially libraries started using RDD as data exchange
format, but soon discovered some limitations● Limitations of RDD as data exchange format is
○ No defined schema. Need to come up with domain object for each library
○ Too low level○ Custom serialization is hard to integrate
DataFrame as data exchange format● From last few release, spark is making Dataframe as
new data exchange format of Spark● Dataframe has schema and can be easily passed
around between libraries● Dataframe is higher level abstraction compared RDD● As Dataframe are serialized using platform specific
code generation, all libraries will be following same serialization
● Dataset will follow the same advantages
Learnings from Spark 1.x● Structured/Semi structured data is the first class of Big
data processing system● Custom memory management and code generated
serialization gives best performance on JVM● DataFrame/ Dataset are the new abstraction layers to
build next generation big data processing system● DSL is the way forward over Map/Reduce like API’s● Having high level structured abstractions make libraries
coexist happily on a platform
References1. http://spark.apache.org/news/spark-1-0-0-released.html2. https://www.youtube.com/watch?v=ckX6fT3kYG03. https://www.youtube.com/watch?v=iKOGBr-kOks4. https://www.youtube.com/watch?v=7nIMpD5TyNs5. https://www.youtube.com/watch?v=jErEhxP8LYQ