9 Algorithms: PageRank

21
9 Algorithms: PageRank

description

9 Algorithms: PageRank. Ranking. After matching, have to rank:. Index Based Ranking. Strategies we could (do) use: Frequency Position Metadata. Missing Ingredient. Index lacks intra-page information. Link Quality. More links is easy to abuse. Spam Link Pages. Link Quality. - PowerPoint PPT Presentation

Transcript of 9 Algorithms: PageRank

Page 1: 9 Algorithms: PageRank

9 Algorithms:PageRank

Page 2: 9 Algorithms: PageRank

Ranking

• After matching, have to rank:

Page 3: 9 Algorithms: PageRank

Index Based Ranking

• Strategies we could (do) use:– Frequency– Position– Metadata

Page 4: 9 Algorithms: PageRank

Missing Ingredient

• Index lacks intra-page information

Page 5: 9 Algorithms: PageRank

Link Quality

• Not all links are equal• Who do you trust?– CS Prof– World Famous Chef

Page 6: 9 Algorithms: PageRank

Identifying Authority

• Links into a page give it authority• Page value = sum of authorities of pages

linking to it

Page 7: 9 Algorithms: PageRank

Link Quality

• More links is easy to abuse Spam Link Pages

Page 8: 9 Algorithms: PageRank

Issues

• Spam Links– Discourage with negative weight

Spam Link Pages

-1

-1

-1

-1

-1

-1

Page 9: 9 Algorithms: PageRank

Issues

• Cycles:

Page 10: 9 Algorithms: PageRank

Issues

• Cycles:

Page 11: 9 Algorithms: PageRank

Issues

• Cycles:

Page 12: 9 Algorithms: PageRank

Random Surfer

• Simulating a web surfing session– Start at random page– At each page have a chance to

• Pick a random link to go to• Jump to a completely random page

Page 13: 9 Algorithms: PageRank

Results

• Results of many random sessions:

Page 14: 9 Algorithms: PageRank

Results

• Expressed as percentages, results stabilize– Law of large numbers

Page 15: 9 Algorithms: PageRank

Cycle Buster

• Random surfer not phased by cycles:

Page 16: 9 Algorithms: PageRank

Random Surfer In Use

• The recipe pages visited by random surfers:

Page 17: 9 Algorithms: PageRank

Simulator

• PageRank Simulator:http://caccio.blogdns.net/software/pagerank-simulator

Page 18: 9 Algorithms: PageRank

The Real Math

• Markov Chains– Set of states– Each state has probability of leading to other

states– Represent as matrix

Page 19: 9 Algorithms: PageRank

Excel Simulation

• Three pages:

Page 20: 9 Algorithms: PageRank

Limitations

• Still have issues/room for growth– Link Spam– Context of link• Where link is on page• "Bob's recipe is terrible" vs "Bob's recipe is great"

– Lack of semantic knowledge• Page's Authority should not be the same for all domains

Page 21: 9 Algorithms: PageRank

Power

• Controlling search is power:http://www.bitsbook.com/

"If you're not paying for the product, you are the product."