[@NaukriEngineering] Apache Spark

of 29 /29
Apache Spark Riya Singhal

Transcript of [@NaukriEngineering] Apache Spark

Page 1: [@NaukriEngineering] Apache Spark

Apache Spark

Riya Singhal

Page 2: [@NaukriEngineering] Apache Spark

Agenda● What is Big Data?

● What is the solution of Big data?

● How Apache Spark can help us?

● Apache Spark advantages over Hadoop MapReduce

Page 3: [@NaukriEngineering] Apache Spark

What is Big Data?● Lots of Data (Terabytes or Petabytes).

● Large and complex.

● Difficult to deal using Relational Databases.

● Challenges faced in - searching, storing, transfer, analysis, visualisation.

● Require Parallel processing on 100s of machines.

Page 4: [@NaukriEngineering] Apache Spark

Hadoop MapReduce● Allows distributed processing of large datasets across clusters.

● It is open source database management with scale out storage and distributed


● Characteristics:

○ Economical

○ Scalable

○ Reliable

○ Flexible

Page 5: [@NaukriEngineering] Apache Spark

MapReduce● Map - Data is converted into tuples (key/value pair).

● Reduce - Takes input from map and combines input from map to form smaller set of


● Advantages

○ Scale data○ Parallel Processing○ Fast○ Built in fault tolerant

Page 6: [@NaukriEngineering] Apache Spark


Page 7: [@NaukriEngineering] Apache Spark

Shortcomings of MapReduce1. Slow for Iterative Jobs.

2. Slow for Interactive Ad-hoc queries.

3. Operations - Forces task be of type Map and Reduce.

4. Difficult to program - Even simple join operations also require extensive code.

Lacks data sharing. Data sharing done through stable storage (HDFS)→ slow.

Slow due to replication and Disk I/O but it is essential for fault tolerance.

Can we use memory? How will it be fault tolerant?

Page 8: [@NaukriEngineering] Apache Spark

Apache Spark● Developed in 2009 by UC Berkeley.

● Processing engine.

● Used for speed, ease of use, and sophisticated analytics.

● It is based on Hadoop MapReduce but it extends MapReduce for performing

more types of computations.

● Spark participated in Daytona Gray category, Spark sorted 100 TB of data (1

trillion records) the same data three time faster using ten times fewer

machines as compared to Hadoop.

Page 9: [@NaukriEngineering] Apache Spark

Apache Spark● Improves efficiency through

○ In-memory data sharing.

○ General computation graph.

● Improves usability through

○ Rich APIs in Java, Scala, Python.

○ Interactive Shell.

HOW ??

Upto 100x faster in memory and 10x faster on disk

Upto 2-5x less code

Page 10: [@NaukriEngineering] Apache Spark

Resilient Distributed Dataset (RDD)● Fundamental Data Structure of Apache Spark.

● Read-only collection of objects partitioned across a set of machines.

● Perform In-memory Computation.

● Build on transformation operations like map, filter etc.

● Fault tolerant through lineage.

● Features: ○ Immutable○ Parallel○ Cacheable○ Lazy Evaluated

Page 11: [@NaukriEngineering] Apache Spark

Resilient Distributed Dataset (RDD)Two types of operation can be performed:

● Transformation○ Create new RDD from existing RDD.

○ Creates DAG.

○ Lazily evaluated.

○ Increases efficiency by not returning large dataset.

○ Eg. GroupByKey, ReduceByKey, filter.

● Action○ All queries are executed.

○ Performs computation.

○ Returns result to driver program.

○ Eg. collect, count, take.

Page 12: [@NaukriEngineering] Apache Spark
Page 13: [@NaukriEngineering] Apache Spark
Page 14: [@NaukriEngineering] Apache Spark
Page 15: [@NaukriEngineering] Apache Spark

Ready for some programming…..

(using python)

Page 16: [@NaukriEngineering] Apache Spark

Creating RDD# Creates a list of animal.animals = ['cat', 'dog', 'elephant', 'cat', 'mouse', ’cat’]

# Parallelize method is used to create RDD from list. Here “animalRDD” is created.#sc is Object of Spark Context.animalRDD = sc.parallelize(animals)

# Since RDD is lazily evaluated, to print it we perform an action operation, i.e. collect() which is used to print the RDD.print animalRDD.collect()

Output - ['cat', 'dog', 'elephant', 'cat', 'mouse', 'cat']

Page 17: [@NaukriEngineering] Apache Spark

Creating RDD from file#The file words.txt has names of animals through which animalsRDD is made.

animalsRDD = sc.textFile('/path/to/file/words.txt')

#collect() is the action operation.

print animalsRDD.collect()

Page 18: [@NaukriEngineering] Apache Spark

Map operation on RDD‘’’’’ To count the frequency of animals, we make (key/value) pair - (animal,1) for all the animals and then perform reduce operation which counts all the values.Lambda is used to write inline functions in python.‘’’’’mapRDD = animalRDD.map(lambda x:(x,1))

print mapRDD.collect()

Output - [('cat',1), ('dog',1), ('elephant',1), ('cat',1), ('mouse',1), ('cat',1)]

Page 19: [@NaukriEngineering] Apache Spark

Reduce operation on RDD‘’’’’ reduceByKey is used to perform reduce operation on same key. So in its arguments, we have defined a function to add the values for same key. Hence, we get the count of animals.‘’’’’reduceRDD = mapRDD.reduceByKey(lambda x,y:x+y)

print reduceRDD.collect()

Output - [('cat',3), ('dog',1), ('elephant',1), ('mouse',1)]

Page 20: [@NaukriEngineering] Apache Spark

Filter operation on RDD‘’’’’ Filter all the animals obtained from reducedRDD with count greater than 2. x is a tuple made of (animal, count), i.e. x[0]=animal name and x[1]=count of animal. Therefore we filter the reduceRDD based on x[1]>2.‘’’’’filterRDD = reduceRDD.filter(lambda x:x[1]>2)

print filterRDD.collect()

Output - [('cat',3)]

Page 21: [@NaukriEngineering] Apache Spark

Please refer http://spark.apache.org/docs/latest/programming-guide.html for

more about programming in Apache Spark.

Page 22: [@NaukriEngineering] Apache Spark




Page 23: [@NaukriEngineering] Apache Spark
Page 24: [@NaukriEngineering] Apache Spark

Spark vs. Hadoop● Performance

○ Spark better as it does in-memory computation. ○ Hadoop is good for one pass ETL jobs and where data does not fit in memory.

● Ease of use○ Spark is easier to program and provides API in Java, Scala, R, Python.○ Spark has an interactive mode.○ Hadoop MapReduce is more difficult to program but many tools are available to

make it easier.

● Cost○ Spark is cost effective according to benchmark, though staffing can be costly.

● Compatibility○ Compatibility to data types and data sources is the same for both.

Page 25: [@NaukriEngineering] Apache Spark

Spark vs. Hadoop● Data Processing

○ Spark can perform real time processing and batch processing.

○ Hadoop MapReduce is good for batch processing. Hadoop requires storm for real

time processing, Giraph for graph processing, Mahout for machine learning.

● Fault tolerant

○ Hadoop MapReduce is slightly more tolerant.

● Caching

○ Spark can cache the input data.

Page 26: [@NaukriEngineering] Apache Spark

Applications Companies that uses Hadoop and Spark are:

● Hadoop - Hadoop is used good for static operation.

○ Dell, IBM, Cloudera, AWS and many more.

● Spark○ Real-time marketing campaign, online product recommendations etc.

○ eBay, Amazon, Yahoo, Nokia and many more.○ Data mining 40x times faster than Hadoop (Conviva).○ Traffic Prediction via EM (Mobile Millennium).○ DNA Sequence Analysis (SNAP).○ Twitter Spam Classification (Monarch).

Page 27: [@NaukriEngineering] Apache Spark

Apache Spark helping companies grow in their business

● Spark Helps Pinterest Identify Trends - Using Spark, Pinterest is able to

identify—and react to—developing trends as they happen.

● Netflix Leans on Spark for Personalization Aid - Netflix uses Spark to support

real-time stream processing for online recommendations and data monitoring.

Page 28: [@NaukriEngineering] Apache Spark

Libraries of Apache Spark Spark provides libraries to provide generality. We can combine these libraries seamlessly in the same application to provide more functionality.

Libraries provided by Apache Spark are:

1. Spark Streaming - It supports scalable and fault tolerant processing of streaming data.

2. Spark SQL - It allows spark to work with structured data.3. Spark MLlib - It provides scalable machine learning library and has machine

learning and statistical algorithms. 4. Spark GraphX - It is used to compute graphs over data.

Refer http://spark.apache.org/docs/latest/ for more information.

Page 29: [@NaukriEngineering] Apache Spark