ACADGILD Webinar - Big Data Analytics with 'Spark': The New Poster boy of Big Data

presents

Spark

Presented by: Sandy

© copyright ACADGILD

Introduction to ACADGILD

• You can also click on this link to view the video –

https://www.youtube.com/watch?v=7nipSdxv2Uo

Webinar on Spark 2

https://www.youtube.com/watch?v=7nipSdxv2Uo


Introduction of Mentor

• The Mentor for this Webinar is Mr. Sandy and below are his qualifications:

• 15 years of experience in IT focusing on Big Data, Data Science and IoT solutions and implementations.

• Expert in the Apache SPARK Ecosystem including Spark 1.6, Scala, Spark SQL, Spark Streaming, MLLIB , SparkR and GraphX.

• Extensive experience in Hadoop Framework solutions including YARN,/MesosHDFS, MapReduce, PigLatin , Hive, HBase/MongoDB/Cassandra, Mahout, Flume, Zookeeper, Oozie and Sqoop.

• Knowledge of Machine Learning for both Supervised and Unsupervised Learning Algorithms.

Webinar on Spark 3


Agenda

4Webinar on Spark

Sl No. Agenda Title

1 What is Big data?

2 MapReduce Limitations

3 Introduction to Spark

4 Spark in Hadoop Ecosystem

5 Why In-memory Processing?

6 In-memory Caching

7 Resilient Distributed Dataset

8 Creating RDDs

9 Spark Unified Platform

10 Popular Use Cases

11 Apache Spark Case Studies

12 Get Your Feet Wet with Spark API's

4

© copyright ACADGILD5

What is Big data?

Webinar on Spark 5


MapReduce Limitations

6

• MapReduce is based on disk based computing.

• It is more suitable for single pass computations.

• It is not at all suitable for iterative computations.

• Disk intensive.

Programming Model limitations:

• Developing efficient MapReduce applications requires advanced programming skills and deep understanding of the system architecture.

• Every problem has to be broken down in to Map and Reduce phases.

Webinar on Spark 6


Introduction to Spark• Apache Spark is a fast and general-purpose cluster computing system.

• Spark is a framework for Scheduling, Monitoring and Distributing the applications.

• Spark is a General Unified Engine which can replace many specialized systems like Mahout, Tez, Graphlab, Storm, etc.

Webinar on Spark 7


SQL

GraphX

MLlib

Streaming

RDBMS

Distributions:

DatabasesFile systems

Streaming

sources

Resource Managers

Libraries

APIs

Spark in Hadoop Ecosystem

8Webinar on Spark

- CDH- HDP- Map R- DSE


Earlier Now

RAM was very costly. Comparatively, disk was cheap, so disk was primary source of data.

Cost of RAM has been sharply reduced with increase in performance. So, RAM is primary source of data and we use disk for fallback.

Network was costly so data locality Network is faster.

Single core machines were dominant Multi core machines are commonplace

Why In-memory processing?

• Drastic change in hardware

9Webinar on Spark


In-memory Caching

10Webinar on Spark


Resilient Distributed Dataset

• Resilient distributed dataset (RDD), represents an immutable collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

• It’s a distributed memory abstraction.

Features:

• Cache an RDD in memory across machines.

• Reuse in multiple MapReduce like parallel operations.

• Fault tolerant through lineage.

11Webinar on Spark


Creating RDDs

• Turn a collection into an RDD.

val a = sc.parallelize(Array(1, 2, 3))

• Load text file from local FS, HDFS, or S3.

val a = sc.textFile("file.txt")

val b = sc.textFile("directory/*.txt")

val c = sc.textFile("hdfs://namenode:9000/path/file")

12Webinar on Spark


Graph

Spark Core Engine

MLlib

Machine

Learning

Spark

Streaming

Streaming

Graphx

ComputationSpark R

R on SparkSpark SQL

DataFrame

Spark Unified Platform

13Webinar on Spark


29%

36%

40%

44%

52%

68%

Popular Use Cases

14

Business Intelligence

Data Warehousing

Recommendation

Log Processing

User-Facing Services

Fraud Detection/ Security

Webinar on Spark


Apache Spark Case Studies

Credit Card Fraud Detection

Network Security

Genomic Sequencing

Real-Time Ad Processing

15Webinar on Spark


Get Your Feet Wet with Spark API's

Quick tour of Scala, Python, Java API's

16Webinar on Spark


Any Questions?

Webinar on Spark 17

Contact Info:

oWebsite : http://www.acadgild.com

oLinkedIn : https://www.linkedin.com/company/acadgild

oFacebook : https://www.facebook.com/acadgild

oSupport: [email protected]


Get in Touch with Us

18Webinar on Spark 18

http://www.acadgild.com/

https://www.linkedin.com/company/acadgild

https://www.facebook.com/acadgild

mailto:[email protected]


Thank You

Webinar on Spark 19

ACADGILD Webinar - Big Data Analytics with 'Spark': The New Poster boy of Big Data

Technology

Transcript of ACADGILD Webinar - Big Data Analytics with 'Spark': The New Poster boy of Big Data