Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!!...

86
Spark introduction RDD Building and running Spark applications Lightning-fast cluster computing

Transcript of Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!!...

Page 1: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Spark introduction ������RDD ������Building and running Spark applications

Lightning-fast cluster computing

Page 2: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

2  

2009  NoSQL  

Along  with  Pig  

2007  Hive  2012  RDD  concept  paper  published  

Page 3: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

The  beginning  of  Spark

• Originator:  Matei  Zaharia  •  Start  in  2009  as  a  class  project  in  UC  Berkeley’s  AMPlab    

•  Need  to  do  machine  learning  faster  on  HDFS  • Doctoral  dissertaHon  (2013)  •  hMp://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-­‐2014-­‐12.pdf  

• Hear  Matei  talking  •  hMps://www.youtube.com/watch?v=BFtQrfQ2rn0  

3  

Page 4: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

4  

Page 5: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

5  

Page 6: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

IBM  

•  2015.6  •  hMps://www-­‐03.ibm.com/press/us/en/pressrelease/47107.wss  

6  

Page 7: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

2015.9  

7  

Page 8: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

What  is  Spark?

• A  general  execuHon  engine  to  improve/replace  MapReduce  

•  Spark’s  operators  are  a  strict  superset  of  MapReduce  

8  

Page 9: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

What’s  wrong  with  the    original  MapReduce?

9  

Page 10: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

10  

Page 11: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

What’s  wrong  with  the    original  MapReduce? •  LimitaHons  of  MapReduce.  

•  Originated  around  year  2000.  Old  technology.  •  Designed  for  batch-­‐processing  large  amount  of  webpages  in  Google  

•  And  it  does  that  job  very  well!    • Not  fit  for    

•  Complex,  mulH-­‐passing  algorithms  •  InteracHve  ad-­‐hoc  queries  •  Real-­‐Hme  stream  processing  

11  

Page 12: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

We  are  asking  too  much  from  MapReduce

12  

Page 13: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

13  

Page 14: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

The  Spark  way!

14  

Page 15: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

15  

Page 16: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

16  

Page 17: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Easier  to  develop  on  Spark

•  Think  of  Assembly  language  

•  Python  print  “Hello  world!”  

17  

Original  MapReduce  

Spark  

Page 18: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Word  count

• Mapreduce:  •  hMps://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v2.0  

•  Spark  •  hMps://spark.apache.org/examples.html  

18  

Page 19: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Spark  is  not  just  in-­‐memory  processing  -­‐-­‐  it  is  faster  on  disk  too!

19  

Page 20: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

A  unified  engine

20  

Page 21: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

21  

Page 22: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

22  

Page 23: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Core  Spark  data  abstracIon

• Resilient  Distributed  Dataset  (RDD)  

23  

Page 24: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

RDDs

24  

Page 25: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

CreaIng  RDD

25  

Page 26: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

RDD  operaIons

26  

Page 27: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

RDD  operaIons:  AcIons

27  

Page 28: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

RDD  operaIons:  TransformaIon

28  

Page 29: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Example:  map  and  filter

29  

Page 30: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Lazy  execuIon

30  

Page 31: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Chaining  transformaIons

31  

Page 32: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

RDD  lineage  and  toDebugString

32  

Page 33: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

FuncIonal  programming  in  spark

33  

Page 34: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Passing  funcIons  as  parameters

34  

Page 35: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Passing  named  funcIons

35  

Page 36: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Anonymous  funcIons

36  

Page 37: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

CreaIng  RDDs  from  collecIons

37  

Page 38: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

CreaIng  RDDs  from  files  (1)

38  

Page 39: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

CreaIng  RDDs  from  files  (2)

39  

Page 40: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Whole  file-­‐based  RDDs  (1)

40  

Page 41: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Whole  file-­‐based  RDDs  (2)

41  

Page 42: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Some  other  RDD  operaIons

42  

Page 43: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Example:  flatMap  and  disInct  

43  

Page 44: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Example:  mulI-­‐RDD  transformaIons

44  

Page 45: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Some  other  RDD  operaIons

45  

Page 46: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Conclusion

46  

Page 47: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

AggregaIng  data  with  pair  RDDs

47  

Page 48: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Pair  RDDs

48  

Page 49: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

CreaIng  pair  RDDs

49  

Page 50: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Example:  a  simple  pair  RDDs

50  

Page 51: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Example:  keying  by  user  ID

51  

Page 52: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

QuesIon1:  pairs  with  complex  values

52  

Page 53: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Answer1:  pairs  with  complex  values

53  

Page 54: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

QuesIon2:  mapping  single  rows  to  mulIple  pairs

54  

Page 55: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Answer2:  mapping  single  rows  to  mulIple  pairs

55  

Page 56: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Map-­‐reduce

56  

Page 57: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Map-­‐reduce  in  spark

57  

Page 58: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Example:  word-­‐count  (1)

58  

Page 59: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Example:  word-­‐count  (2)

59  

Page 60: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Example:  word-­‐count  (3)

60  

Page 61: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Example:  word-­‐count  (4)

61  

Page 62: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

ReduceByKey  (1)

62  

Page 63: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

ReduceByKey  (2)

63  

Page 64: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Other  pair  RDD  operaIons

64  

Page 65: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Example:  pair  RDD  operaIons

65  

Page 66: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Example:  joining  by  key

66  

Page 67: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Using  join

67  

Page 68: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Example:  join  web  log  with  knowledge  base  arIcle

68  

Page 69: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Example:  join  web  log  with  knowledge  base  arIcle

69  

Page 70: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

70  

Page 71: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

71  

Page 72: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

72  

Page 73: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

73  

Page 74: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

74  

Page 75: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Example  output

75  

Page 76: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Other  pair  operaIons

76  

Page 77: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Writing and deploying spark applications

77  

Page 78: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

The  SparkContext

78  

Page 79: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Python  example:  word-­‐count

79  

Page 80: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Building  a  spark  applicaIon

80  

Page 81: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

81  

Page 82: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Running  a  spark  applicaIon

82  

Page 83: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Running  spark  applicaIons  locally

83  

Page 84: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Running  spark  applicaIons  on  cluster

84  

Page 85: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

StarIng  shell  locally  or  on  cluster

85  

Page 86: Spark introduction RDD Building and running Spark applications · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

86