Hadoop Vs Spark — Choosing the Right Big Data Framework

Hadoop Vs Spark — Choosing the Right

Big Data Framework

We are surrounded by data from all sides. It is estimated that by 2020, the digital

universe will be as large as 44 zettabytes—as many digital bits as there are stars in

the universe.

https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm

https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm

The data is increasing and we are not getting rid of it any time soon. And to digest

all this data, there seems to be an increasing number of distributed systems on the

market. Among these systems, a battle that is most famous is Hadoop vs Spark—

frameworks that are often pitted against one another as direct competitors.

When deciding which of these two frameworks is right for you, it’s important to

compare them, based on the few essential parameters. In this blog, we have shed

some light upon such parameters.

1. Performance

Spark is lightning-fast and has been found optimal over the Hadoop framework. It

runs 100 times faster in-memory and 10 times faster on disk. Moreover, it is found

that it sorts 100 TB of data 3 times faster than Hadoop using 10X fewer machines.

The reason that Spark is so fast is because it processes everything in memory.

Particularly, Spark is faster on machine learning applications, like Naive Bayes and

k-means. Thanks to Spark’s in-memory processing, it delivers real-time analytics for

data from marketing campaigns, IoT sensors, machine learning, and social media

sites.

However, if Spark, along with other shared services, is running on YARN, its

performance might degrade and can lead to RAM overhead memory leaks. And in

https://spark.apache.org/

https://spark.apache.org/news/spark-wins-daytona-gray-sort-100tb-benchmark.html

https://www.netsolutions.com/data-analytics

this particular scenario, Hadoop emerges out to be the real hero. If a user has a tilt

towards batch processing, Hadoop is much more efficient than its counterpart.

Hadoop is a big data framework that was never built for lightning speed, it uses

batch processing. Its original aim was to incessantly gather information from

websites with no requirements for this data in or near real-time.

Bottom Line: Both Hadoop and Spark have a different way of processing. Thus, it

entirely depends upon the requirement of the project, whether to go ahead with

Hadoop or Spark in the Hadoop vs Spark performance battle.

Facebook and its Transitional Journey with Spark Framework

Data on Facebook increases with each passing second. In fact, it is even growing

while you are reading this blog. So, in order to handle this data and visualize it to

make an intelligent decision, Facebook uses analytics. And for that, it makes use of

a number of platforms as follows:

Hive platform to execute some of

Facebook’s batch analytics

Corona platform for the custom

MapReduce implementation

Presto footprint for ANSI-SQL-based

queries

The Hive platform discussed above was computationally “resource intensive”. So

maintaining it was a huge challenge. Thus, Facebook decided to switch to Apache

Spark framework step-by-step to manage their data. Today, Facebook has

deployed a faster manageable pipeline for the entity ranking systems by integration

of Spark.

2. Security

Spark’s security is still in its emergence stage, supporting authentication only via

shared secret (password authentication). Even Apache Spark’s official

website claims that, “There are many different types of security concerns. Spark

does not necessarily protect against all things.”

https://engineering.fb.com/core-data/apache-spark-scale-a-60-tb-production-use-case/

https://engineering.fb.com/core-data/apache-spark-scale-a-60-tb-production-use-case/

https://spark.apache.org/docs/latest/security.html

https://spark.apache.org/docs/latest/security.html

Hadoop, on the other hand, has better security features than Spark. The security

benefits—Hadoop Authentication, Hadoop Authorization, Hadoop Auditing, and

Hadoop Encryption gets integrated effectively with Hadoop security projects like

Knox Gateway and Sentry.

Bottom Line: In Hadoop vs Spark Security battle, Spark is a little less secure than

Hadoop. However, on integrating Spark with Hadoop, it can use the security

features of Hadoop.

3. Cost

First of all, both Hadoop and Spark are open-source frameworks, and thus, come

for free. Both use commodity servers and run on the cloud, and seem to have

somewhat similar hardware requirements:

So, how to evaluate them on the basis of cost?

Note that Spark makes use of huge amounts of RAM to run everything in memory.

And it is a fact that RAM comes under a higher price tag than hard-disks.

On the other hand, Hadoop is disk-bound. Thus, your cost of buying an expensive

RAM gets saved. However, Hadoop needs more systems to distribute the disk I/O

over multiple systems.

Therefore, when comparing Spark and Hadoop framework on the parameters of

cost, organizations will have to ponder at their requirements.

If the requirement has more tilt towards processing large amounts of historical big

data, definitely, Hadoop is the choice to go ahead with because hard disk space

comes at a much cheaper price than memory space.

On the contrary, in the case of Spark, it can be cost-effective when we deal with

the option of the real-time data as it makes use of less hardware to perform the

same tasks at a much faster rate.

Bottom Line: In Hadoop vs Spark cost battle, Hadoop definitely costs less, but Spark

is cost-effective when an organization has to deal with less amount of real-time

data.

https://www.datamation.com/big-data/data-quality-software.html

https://www.datamation.com/big-data/data-quality-software.html

4. Ease of Use

One of the biggest USPs of the Spark framework is its ease of use. Spark has user-

friendly and comfortable APIs for its native language Scala and Java, Python, and

Spark SQL (also known as Shark).

The simple building blocks of Spark make it easy to write user-defined functions.

Moreover, since Spark allows for batch processing and machine learning, it

becomes easy to simplify the infrastructure for data processing. It even includes an

interactive mode for running commands with immediate feedback.

On the other hand, Hadoop is written in Java and has a bad reputation of paving

the way for the difficulty in writing a program with no interaction mode. Although

Pig (an add-on tool) makes it easier to program, it demands some time to learn the

syntax.

Bottom Line: In ‘Ease of Use’ Hadoop vs Spark battle, both of them have their own

ways to make themselves user-friendly. However, if we have to choose one, Spark

is easier to program and moreover includes an interactive mode.

Is it Possible for Apache Hadoop and Spark to Have a Synergic Relationship?

Yes, it is very much possible and we recommend too. Let’s get into the details on

how they can work in tandem.

Apache Hadoop ecosystem includes HDFS, Apache Query, and HIVE. Let’s see how

Apache Spark can make use of them.

An Amalgamation of Apache Spark and HDFS

The purpose of Apache Spark is to process data. However, in order to process data,

the engine needs the input of data from storage. And for this purpose, Spark uses

HDFS (not the only option, but the most popular one since Apache is the brain

behind both of them).

A Blend of Apache Hive and Apache Spark

Apache Spark and Apache Hive are highly compatible as together they can solve

many business problems.

For instance, a business is into analyzing consumer behavior. Now for this, the

company will need to gather data from various sources like social media,

comments, clickstream data, customer mobile apps, and many more.

Now, an intelligent move by the organization will be to make use of HDFS to store

the data and Apache hive as a bridge between HDFS and Spark.

Uber and its Amalgamated Approach

To process the big data of their consumer, Uber uses a combination of Spark and

Hadoop. It uses real-time traffic situation to provide drivers in a particular time and

location. And to make this possible, Uber uses HDFS for uploading raw data into

Hive, and Spark for processing of billions of events.

Hadoop vs Spark: And the Winner Is

While Spark is faster than thunder and is easy to use, Hadoop comes with robust

security, mammoth storage capacity, and low-cost batch processing capabilities.

Choosing one out of two depends entirely upon the requirement of your project,

https://eng.uber.com/hdfs-file-format-apache-spark/

https://eng.uber.com/hdfs-file-format-apache-spark/

the other alternative being combining parts of Hadoop and Spark to give birth to

an invincible combination.

Remember! “Between two evils, choose neither; between two goods, choose both.”—Tryon Edwards

Mix some attributes of Spark and some of Hadoop to come up with a brand new

framework: Spoop.

Source - https://www.netsolutions.com/insights/hadoop-vs-spark/

https://www.netsolutions.com/insights/hadoop-vs-spark/

Hadoop Vs Spark — Choosing the Right Big Data Framework

Services

Transcript of Hadoop Vs Spark — Choosing the Right Big Data Framework