1 How Search Engines Work? Ziv Bar-Yossef Department of Electrical Engineering Technion.

20
1 How Search Engines Work? Ziv Bar-Yossef Department of Electrical Engineering Technion
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of 1 How Search Engines Work? Ziv Bar-Yossef Department of Electrical Engineering Technion.

1

How Search Engines Work?

Ziv Bar-Yossef

Department of Electrical Engineering Technion

2

What is the Internet?

A global network of computers connected to each other

Computers “talk” to each other using standard protocols TCP/IP

3

What is the World-Wide Web (WWW)?

Collection of pages available via the Internet Internet users can view

pages with web browsersWWW is only one

application of the InternetOther applications: email,

messengers, VOIP, newsgroups, ftp

4

Web Pages Various formats

pdf, word, excel, images, mp3, video, text

Most popular format: HTMLHTML pages point

to each other using hyperlinks

Users “surf the web” by clicking hyperlinks

5

What are Search Engines?

Users have “information needs” Where can I find solutions to my math homework

problem? Where can I find mp3s of Miri Messika’s latest album? What is the weather in Eilat in Channuka? What other Sharons are famous except for our prime

minister?

Search engines enable us to find web pages that match our information needs

6

What other Sharons are famous, except for

our prime minister?

Search Engines

queryUser

“Information Need”

sharon -ariel

1. Sharon Creech2. Sharon Stone3. Sharon, Massachusetts

Ranked list of matching pages

Search Engine

Search Engine

Web pages

Web

7

How Search Engines (don’t) Work?

queryUser

sharon -ariel

1. Sharon Creech2. Sharon Stone3. Sharon, Massachusetts

Ranked list of matching pages

Web pages

Common misconception: when user submits a query, the search engine scans all web pages to find the relevant matches

Search Engine

Search EngineWeb

8

How Search Engines Work?

queryUser

1. Sharon Creech2. Sharon Stone3. Sharon, Massachusetts

Ranked list of matching pagesWeb pages

What do you do when you look for a term in an encyclopedia? Use the index!

Web

Search Engine

index

sharon -ariel

9

Search Engine Architecture

CrawlerCrawler

Search Engine

IndexIndex

RankingAlgorithmRanking

AlgorithmQuery

ProcessorQuery

Processor

10

Web Crawler (a.k.a. Spider)

Fetches web pages and stores them in a local repository

Tries to get as many web pages as possible

Follows hyperlinks to learn about new pages

Refetches pages that change frequently

11

The Index

Ariel1 Sharon2, the3 prime4 minister5 of6 Israel7 founded8 a9 new10 political11 party12.

Sharon1 Stone2 dressed3 a4 new5 Jean6 Paul7 Gaultier8 gown9 at10 the11 Oscars12 after13 party14.

www.cnn.com

ariel: (cnn.com,1)

dress: (hollywood.com,3)

found: (cnn.com,8)

gaultier: (hollywood.com,8)

gown: (hollywood.com,9)

israel: (cnn.com,7)

jean: (hollywood.com,6)

minister: (cnn.com,5)

new: (cnn.com,7), (hollywood.com, 5)

oscar: (hollywood.com,12)

party: (cnn.com,12), (hollywood.com,14)

paul: (hollywood.com,7)

political: (cnn.com,11)

prime: (cnn.com,4)

sharon: (cnn.com,2), (hollywood.com,1)

stone: (hollywood.com,2)

Index

www.hollywood.com

12

Index by “Anchor Text”

Anchor text: what’s written inside a linkExample: Ariel Sharon, the prime minister…

Usually succinctly describes what’s written in the linked page

By which terms a page is listed in the index?Terms that appear in the pageTerms that appear in anchor text of links to the

page

13

Query Processor

Gets a user query Fetches relevant posting lists from index Extracts relevant matches from lists Example: Query = “sharon –ariel”

L1 posting list of sharon sharon: (cnn.com,2), (hollywood.com,1)

L2 posting list of ariel ariel: (cnn.com,1)

Return all pages in L1 that do not occur in L2

cnn.com

14

Ranking Algorithm

Many queries have many matching pages 472 million matches for “London” in Google

Cannot return all of them to the user User needs the most relevant results anyway

Need to order results by relevance Most relevant results are at the top

Ranking algorithm: a method of ordering matches The “heart” of a search engine The reason why Google is the most preferred search

engine today

15

Google’s PageRank Ranking Elections

Candidates: all web pages Voters: all web pages p votes to q, if p has a hyperlink to q.

Favorites(p) = all the pages p votes for. Fans(p) = all the pages that vote for p.

1 if p has no fans

16

Google’s PageRank

Underlying principles:A page is “important” if it has important fansA page splits its “importance” evenly among its

favorite pages.

1

1

1

1

1.5

2.5

4

17

Google’s PageRank

Ranking algorithm:Find pages that match the given queryOrder them by their PageRankReturn top 10 matches

18

But…PageRank Not Always Works

SPAM

19

Conclusions

Search engines use index to answer user queries

Ranking is the most important component Spam is a problem

20

Thank You