Spark vs Hadoop

36
Apache Spark Data Analytics. Comparison to the Existing Technology at the Example of Apache Hadoop MapReduce. Final Presentation Seminar: „Data Science in the Era of Big Data“ Olesya Eidam Technische Universität München 13.08.2015

Transcript of Spark vs Hadoop

Page 1: Spark vs Hadoop

Apache Spark Data Analytics.Comparison to the Existing Technology at the Example of Apache Hadoop MapReduce.

Final Presentation

Seminar: „Data Science in the Era of Big Data“

Olesya Eidam

Technische Universität München

13.08.2015

Page 2: Spark vs Hadoop

IntroductionA brief introduction of the existing big data analytics tools

Page 3: Spark vs Hadoop

Source: [1]

The World of Big DataApache Hadoop and Spark within the context of big data analytics:

Page 4: Spark vs Hadoop

Outline

1. Introduction

2. Hadoop

3. Spark

4. Spark vs. Hadoop MapReduce

5. Spark + HDFS

6. Machine Learning: K-Means

Page 5: Spark vs Hadoop

Apache HadoopThe framework for handling big data based on several interlocking technologies

Page 6: Spark vs Hadoop

What is Hadoop?The Hadoop project’s open-source software for reliable, scalable, distributed computing

Source: [7], [8]

Page 7: Spark vs Hadoop

HDFS and YARN ArchitectureA Hadoop cluster is characterized by a master – slave architecture, which utilizes the “shared-nothing” principle for effective data processing.

Source: [11]

Page 8: Spark vs Hadoop

Map Reduce: an exampleMapReduce means breaking the processing into two phases: the map phase and the reduce phase, both performed in a distributed, parallel way on a cluster of computers.

Source: [11]

Page 9: Spark vs Hadoop

MapReduce within Hadoop Framework…represents a scalable solution, which can be extended to several reduce tasks…

Source: [18]

Page 10: Spark vs Hadoop

Limitations of Hadoop MapReduce …however not necessarily a universally suitable solution especially for the tasks with growing importance.

Source: [2]

Page 11: Spark vs Hadoop

Shuffle and SortSlow due to replication, serialization, I/O. Inefficient for iterative algorithms and interactive data mining:

Source: [4]

Page 12: Spark vs Hadoop

Apache Spark An open-source project for fast, in-memory and large-scale data processing

Page 13: Spark vs Hadoop

What is Spark ?“Effective, fast, general-purpose cluster computing framework with high level APIs in Java, Scala, Python and R”:

Source: [9]

Page 14: Spark vs Hadoop

Spark‘s buildupIn addition to the benefits of HDFS Spark relies on DAG* pattern for complex, multi-step data pipelines and in-memory data sharing across DAG.

Source: [12] *DAG: Directed Acyclic Graph

Page 15: Spark vs Hadoop

Anatomy of RDD Distributed collections of objects that can be cached in memory across cluster nodes.

Source: [5] *RDD: Resilient Distributed Datasets

Some of RDD Characteristics

immutable

resilient,

distributed,

lazily evaluated,

cacheable/persistent and

fault-tolerant

Page 16: Spark vs Hadoop

Actions and TransformationsSpark enables lazy evaluation due to a dependency chain of RDDs. DAG allows for running consistently more complex operations.

Source: [14], [8]

Transformations Return pointers to new RDD Transformations are lazy (Not computed

immediately) Transformed RDDs gets recomputed when

actions run on it RDD can be persisted in memory or disk

Actions Return Values Actions result into a DAG of operations DAG is compiled into stages where each stage is

executed as series of tasks Tasks : Fundamental units of work

Page 17: Spark vs Hadoop

MapReduce vs SparkComparison to Hadoop MapReduce

Page 18: Spark vs Hadoop

The Map SideSpark does not merge or partition spill files, the output of map phase is written to OS buffer cache, each map task outputs as many spill files as number of reducers.

Source: [6]

vs

Hadoop MapReduce Spark

Page 19: Spark vs Hadoop

The Reduce SideThe map phase pushes the data in the form of intermediate (shuffle) files to the reducers. These files are written to reducer’s memory and reduce functionality is invoked.

Source: [6]

Hadoop MapReduce Spark

vs

Page 20: Spark vs Hadoop

Better for Iterative ComputationsData sharing in Hadoop is slow due to replication, serialization and disk I/O.

Source: [16]

vs

Hadoop MapReduce

Spark

Page 21: Spark vs Hadoop

Better for Interactive ComputationsBy the same reason Hadoop underperforms for interactive (low-latency) computations.

Source: [16]

Hadoop MapReduce

Spark vs

Page 22: Spark vs Hadoop

Spark on HDFSCan Spark replace Hadoop ?

Page 23: Spark vs Hadoop

The combination of Hadoop and Spark Operational applications augmented by in-memory performance:

Source: [14]

Hadoop features

Spark features

Page 24: Spark vs Hadoop

K-MeansUse case in machine learning: iterative algorithm for clustering data

Page 25: Spark vs Hadoop

The AlgorithmK-Means works by forming clusters of data points by minimizing the sum of squared distances between the data points and their centroids.

Source: [6]

Page 26: Spark vs Hadoop

A short comparison:

~227 Lines of Code

~64 Lines of Code

Page 27: Spark vs Hadoop

Results by S. Gopalani, R. AroraThe results clearly showed that the performance of Spark turn out to be considerably higher in terms of time.

Source: [6]

Experimental Environment

64MB, 1240 MB with a single node and 1240MB with two nodes

monitored the performance in terms of the time taken for clustering as per the requirements

The machines used had a configuration as follows: • 4GB RAM • Linux Ubuntu • 500 GB Hard Drive

Page 28: Spark vs Hadoop

Results by M. Zacharia et. al.Spark outperforms Hadoop by up to 20x in iterative machine learning and graph applications.

Source: [13]

Page 29: Spark vs Hadoop

Source: [1]

High Performance Computing… Apache Hadoop and Spark within the context of the big data analytics:

Page 30: Spark vs Hadoop

MPI and HARP PerformanceHPC* tools perform better Hadoop and Spark , but can be boosted using a hybrid approach of other technologies that blend HPC and big data, including Spark and HARP.

Source: [17]*HPC: High Performance Computing

Page 31: Spark vs Hadoop

Thank you for your attention!

...any questions?

Page 32: Spark vs Hadoop

LiteratureResources used for this presentation

Page 33: Spark vs Hadoop

LiteratureResources used for this presentation:

[1] B. Zhang. A Brief Introduction of Existing Big Data Tools - A Presentation, Retrieved August 2015, URL: http://scholarwiki.indiana.edu/Z604/slides/big%20data%20tools%20v2.pdf

[2] G. Fox. Multi-faceted Classification of Big Data Uses and Proposed Architecture Integrating High Performance Computing and the Apache Stack – A Presentation for the Sixth Interantional Workshop on Cloud Data Management, Cloud DB 2014, Chicago March 2014.

[3] S. Jha, J. Qiu, A. Luckow, P. Mantha, G. C.Fox. A Tale of Two Data-Intensive Paradigms:Applications, Abstractions, and Architectures. Big Data (BigData Congress), 2014 IEEE International Congress on. IEEE, 2014.

[4] T. White. Hadoop. The Denite Guide. O'Reilly Media, Inc., 2010.

[5] T. Duarte. Anatomy of RDD - An Explanatory Video Illustration, Retrieved in June 2015. URL:http://www.sparkinternals.com/

Page 34: Spark vs Hadoop

LiteratureResources used for this presentation:

[6] A. R. Gopalani, S. Comparing apache spark and map reduce with performance analysis using k-means. International Journal of Computer Applications (0975 - 8887), 113(1), March 2015.

[7] Apache, Inc. Apache™ Hadoop® Documetation, Retreived in July 2015.URL: http://www.apache.org/

[8] Hortonworks, Inc. Hortonworks Data Platform: Getting Started Guide – A Whitepaper, May 2014

[9] Apache, Inc . Apache ™ SparkDocumetation, Retreived in July 2015.URL: http://www.apache.org/

[10] A.Murthy, Hortonworks, Inc. Apache Hadoop 2 is now GA! – A Blog Entry, October 2013, Retrieved August 2015. URL: http://hortonworks.com/blog/apache-hadoop-2-is-ga/

Page 35: Spark vs Hadoop

[11] Edureka!. Apache Hadoop 2.0 and YARN – Instruction, October 2013, Retrieved in August 2015, URL: http://www.edureka.co/blog/apache-hadoop-2-0-and-yarn/

[12] V. Shukla, R. Venkatesh. Hortonworks, Inc. Spark Webinar Presentation, October 2014

[13] e. a. M. Zacharía Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. University of California, Berkeley, 2012.

[14] MC Srivas, MapR Technologies, Inc. Why Spark on Hadoop Matters – A Presentation, July 2014.

[15] Y Wang, R Goldstone, W Yu, T Wang. Characterization and optimization of memory-resident mapreduce on HPC systems . - 2014 IEEE 28th International Parallel & Distributed Processing Symposium

LiteratureResources used for this presentation:

Page 36: Spark vs Hadoop

[16] Databricks, Inc. Intro to Apache Spark – A Workshop Presentation, Retrieved in August 2015. URL: http://training.databricks.com/workshop/itas_workshop.pdf

[17] S. Jha, J. Qiu, A.Luckow, P. Mantha, G. C. Fox. A tale of two data-intensive paradigms: AppliBig Data (BigData Congress), 2014 IEEE International Congress on (pp. 645-652). IEEE. June 2014cations, abstractions, and architectures.

[18] IBM, Inc. What is MapReduce? – An Explanatory Article, Retreived in August 2015. URL: http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/

LiteratureResources used for this presentation: