BigData - PageRank Algorithm with Scala and Spark

12
PageRank - Spark/Scala Yubraj Pokharel

Transcript of BigData - PageRank Algorithm with Scala and Spark

Page 1: BigData - PageRank Algorithm with Scala and Spark

PageRank - Spark/ScalaYubraj Pokharel

Page 2: BigData - PageRank Algorithm with Scala and Spark

PageRank Algorithm Implementation in Spark

Page 3: BigData - PageRank Algorithm with Scala and Spark

What is PageRank?

PageRank of a web page is a number given to the page which represents the relative importance of that page in comparison to all other web pages.

A web page contains inbound and outbound links.

A page which has more inbound links is considered more important.

Page 4: BigData - PageRank Algorithm with Scala and Spark

How to calculate it?PR(A) = (1-d) + d * (PR(T1) / C(T1) + ... + PR(Tn) / C(Tn))

PR(A) => pagerank of a web page

d => damping factor

PR(Tn) => page ranks of web pages inbound to the web page whose page rank we are calculating

C(Tn) => number of outbound links in the web page specified by PR(Tn)

Page 5: BigData - PageRank Algorithm with Scala and Spark

How it is calculated?

Links(A, B)(B, C)(B, E)(C, A)(C, D)(C, E)(D, A)(D, C)(E, B)

Initial Page Ranks

PR(A) = 1.0PR(B) = 1.0PR(C) = 1.0PR(D) = 1.0PR(E) = 1.0

Page 6: BigData - PageRank Algorithm with Scala and Spark

1st iteration

Links(A, B)(B, C)(B, E)(C, A)(C, D)(C, E)(D, A)(D, C)(E, B)

PR(A) = 0.15 + 0.85*(⅓ + ½ ) = 0.8583333333333333

PR(B) = 0.15 + 0.85 * (1/1 + 1/1) = 1.85

PR(C) = 0.15 + 0.85 * (½ + ½) = 1.0PR(D) = 0.433333333333PR(E) = 0.858333333333

Page 7: BigData - PageRank Algorithm with Scala and Spark

2nd iteration

Links(A, B)(B, C)(B, E)(C, A)(C, D)(C, E)(D, A)(D, C)(E, B)

PR(A) = 0.15 + 0.85*(1 / 3 + 0.433333333333 / 2) = 0.6175

PR(B) = 1.60916666666666PR(C) = 1.12041666666666PR(D) = 0.43333333333333PR(E) = 1.21958333333333

Page 8: BigData - PageRank Algorithm with Scala and Spark

30th iteration

Links(A, B)(B, C)(B, E)(C, A)(C, D)(C, E)(D, A)(D, C)(E, B)

(B, 1.685860900896)(E, 1.1661421381814026)(C, 1.0575926664315842)(A, 0.6407530390774026)(D, 0.4496512554136104)

B is the most ranked page

Page 9: BigData - PageRank Algorithm with Scala and Spark

Spark/Scala Code

Page 10: BigData - PageRank Algorithm with Scala and Spark

References1. http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm2. http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html3. http://www.ams.org/samplings/feature-column/fcarc-pagerank4. http://www.umiacs.umd.edu/~jbg/teaching/INFM_718_2011/lecture_3.pdf5. http://www.cse.cuhk.edu.hk/~cslui/CMSC5702/mapreduce_hadoop2.pdf

Page 11: BigData - PageRank Algorithm with Scala and Spark

Questions??

Page 12: BigData - PageRank Algorithm with Scala and Spark

Thank you :) -happy coding