Apache Spark Introduction - CloudxLab

19
Welcome to Hands-on Session on Big Data processing using Apache Spark [email protected] CloudxLab.com +1 419 665 3276 (US) +91 803 959 1464 (IN)

Transcript of Apache Spark Introduction - CloudxLab

Page 1: Apache Spark Introduction - CloudxLab

Welcome to

Hands-on Session on Big Data processing using Apache Spark

[email protected]

+1 419 665 3276 (US) +91 803 959 1464 (IN)

Page 2: Apache Spark Introduction - CloudxLab

Agenda

1 Apache Spark Introduction2 CloudxLab Introduction3 Introduction to RDD (Resilient Distributed Datasets)4 Loading data into an RDD5 RDD Operations Transformation6 RDD Operations Actions7 Hands-on demos using CloudxLab8 Questions and Answers

Page 3: Apache Spark Introduction - CloudxLab

Hands-On: Objective

Compute the word frequency of a text file stored inHDFS - Hadoop Distributed File System

Using Apache Spark

Page 4: Apache Spark Introduction - CloudxLab

Welcome to CloudxLab Session

• Learn Through Practice

• Real Environment

• Connect From Anywhere

• Connect From Any Device

A cloud based lab for students to gain hands-on experience in Big Data

Technologies such as Hadoop and Spark

• Centralized Data sets

• No Installation

• No Compatibility Issues

• 24x7 Support

Page 5: Apache Spark Introduction - CloudxLab

About Instructor?

2015 CloudxLab A big data platform.2014 KnowBigData Founded2014

Amazon Built High Throughput Systems for Amazon.com site using in-house NoSql.

20122012 InMobi Built Recommender that churns 200 TB2011

tBits Global Founded tBits GlobalBuilt an enterprise grade Document Management System

2006

D.E.Shaw Built the big data systems before the term was coined

20022002 IIT Roorkee Finished B.Tech.

Page 6: Apache Spark Introduction - CloudxLab

Apache

A fast and general engine for large-scale data processing.

• Really fast MapReduce

• 100x faster than Hadoop MapReduce in memory,

• 10x faster on disk.

• Builds on similar paradigms as MapReduce

• Integrated with Hadoop

Page 7: Apache Spark Introduction - CloudxLab

Spark Architecture

Spark Core

StandaloneAmazon EC2Hadoop YARN Apache Mesos

HDFS

HBase

Hive

Tachyon

...

SQL Streaming MLLib GraphXSparkR Java Python Scala

Libraries

Languages

Page 8: Apache Spark Introduction - CloudxLab

Getting Started - Launching the console

Open CloudxLab.com to get login/password

Login into Console

Or

• Download

• http://spark.apache.org/downloads.html

• Install python

• (optional) Install Hadoop

Run pyspark

Page 9: Apache Spark Introduction - CloudxLab

SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET

A collection of elements partitioned across cluster

• RDD Can be persisted in memory• RDD Auto recover from node failures• Can have any data type but has a special dataset type for key-value• Supports two type of operations: transformation and action• RDD is read only

Page 10: Apache Spark Introduction - CloudxLab

Convert an existing array into RDD:myarray = sc.parallelize([1,3,5,6,19, 21]);

SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET

Load a file From HDFS:lines = sc.textFile('/data/mr/wordcount/input/big.txt')

Check first 10 lines:lines.take(10); // Does the actual execution of loading and printing 10 lines.

Page 11: Apache Spark Introduction - CloudxLab

SPARK - TRANSFORMATIONS

persist() cache()

Page 12: Apache Spark Introduction - CloudxLab

SPARK - TRANSFORMATIONSmap(func)

Return a new distributed dataset formed by passing each element of the source through a function func. Analogous to

foreach of pig.

filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true.

flatMap(func)

Similar to map, but each input item can be mapped to 0 or more output items

groupByKey([numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.

See More: sample, union, intersection, distinct, groupByKey, reduceByKey, sortByKey,joinhttps://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/JavaPairRDD.

html

Page 13: Apache Spark Introduction - CloudxLab

Define a function to convert a line into words:def toWords(mystr):

wordsArr = mystr.split()return wordsArr

SPARK - Break the line into words

Check first 10 lines:words.take(10);// Does the actual execution of loading and printing 10 lines.

Execute the flatmap() transformation:words = lines.flatMap(toWords);

Page 14: Apache Spark Introduction - CloudxLab

Define a function to clean & convert to key-value:import redef cleanKV(mystr):

mystr = mystr.lower()mystr = re.sub("[^0-9a-z]", "", mystr) #replace non alphanums with spacereturn (mystr, 1); # returning a tuple - word & count

SPARK - Cleaning the data

Execute the map() transformation:cleanWordsKV = words.map(cleanKV);//passing “clean” function as argument

Check first 10 words pairs:cleanWordsKV.take(10);// Does the actual execution of loading and printing 10 lines.

Page 15: Apache Spark Introduction - CloudxLab

SPARK - ACTIONSReturn value to the driver

Page 16: Apache Spark Introduction - CloudxLab

SPARK - ACTIONS

reduce(func)Aggregate elements of dataset using a function:

• Takes 2 arguments and returns one• Commutative and associative for parallelism

count() Return the number of elements in the dataset.

collect() Return all elements of dataset as an array at driver. Used for small output.

take(n) Return an array with the first n elements of the dataset.Not Parallel.

See More: first(), takeSample(), takeOrdered(), saveAsTextFile(path), reduceByKey()https://spark.apache.org/docs/1.5.0/api/python/pyspark.html#pyspark.RDD

Page 17: Apache Spark Introduction - CloudxLab

Define an aggregation function:def sum(x, y):

return x+y;

SPARK - Compute the words count

Check first 10 counts:wordsCount.take(10);// Does the actual execution of loading and printing 10 lines.

Execute the reduce action:wordsCount = cleanWordsKV.reduceByKey(sum);

Save:wordsCount.saveAsTextFile("mynewdirectory");

Page 18: Apache Spark Introduction - CloudxLab

www.KnowBigData.com

After taking a shot with his bow, the archer took a bow.INPUT

words = lines.flatMap(toWords);

(After,1) (taking,1) (bow.,1)(his,1)(with,1)(shot,1) (a,1)(a,1) (took,1)(archer,1)(the,1)(bow,,1)

(after,1) (taking,1) (bow,1)(his,1)(with,1)(shot,1) (a,1)(a,1) (took,1)(archer,1)(the,1)(bow,1)

words.reduceByKey(sm)

(after,1) (taking,1)(bow,1) (his,1) (with,1)(shot,1)(a,1)(a,1) (took,1)(archer,1) (the,1)(bow,1)

(a,2)(bow,2)

words = lines.map(cleanKV);

sm sm

SAVE TO HDFS FIle

Page 19: Apache Spark Introduction - CloudxLab

Thank you.

+1 419 665 3276 (US) +91 803 959 1464 (IN) [email protected]

Subscribe to our Youtube channel for latest videos - https://www.youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA