ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

61
ITEC547 Text Mining Fall 2015-16 Overview of Search Engines

description

Search Engine Characteristics Unedited – anyone can enter content Quality issues; Spam Varied information types Phone book, brochures, catalogs, dissertations, news reports, weather, all in one place! Different kinds of users – Lexis-Nexis: Paying, professional searchers – Online catalogs: Scholars searching scholarly literature – Web: Every type of person with every type of goal Scale – Hundreds of millions of searches/day; billions of docs

Transcript of ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Page 1: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

ITEC547 Text Mining Fall 2015-16

Overview of

Search Engines

Page 2: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Outline of Presentation

1Early Search Engines

2Indexing Text

for Search

3Indexing

Multimedia

4Queries

5Searching an Index

Page 3: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Search Engine Characteristics

• Unedited – anyone can enter content• Quality issues; Spam

• Varied information types• Phone book, brochures, catalogs, dissertations, news

reports, weather, all in one place!

• Different kinds of users– Lexis-Nexis: Paying, professional searchers– Online catalogs: Scholars searching scholarly literature– Web: Every type of person with every type of goal

• Scale– Hundreds of millions of searches/day; billions of docs

Page 4: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Web Search Queries

• Web search queries are short:– ~2.4 words on average (Aug 2000)– Has increased, was 1.7 (~1997)

• User Expectations:– Many say “The first item shown should be what I

want to see!”– This works if the user has the most

popular/common notion in mind, not otherwise.

Page 5: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Background

History, Problems, Solutions …

1

Page 6: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Search Engines

• Open Text (1995-1997)• Magellan (1995-2001)• Infoseek (Go) (1995-2001)• Snap (NBCi)(1997-2001)• Direct Hit (1998-2002)• Lycos (1994; reborn 1999)• WebCrawler (1994; reborn 2001)• Yahoo (1994; reborn 2002)• Excite (1995; reborn 2001)• HotBot (1996; reborn 2002)• Ask Jeeves (1998; reborn 2002)• Teoma (2000- 2001)• AltaVista (1995- )• LookSmart (1996- )• Overture (1998- )

6

Page 7: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

EARLY SEARCH ENGINES

• Initially used in academic or specialized domains.– Legal and specialized domains consume a large

amount of textual info• Use of expensive proprietary hardware and

software– High computational and storage requirements

• Boolean query model• Iterative search model

– Fetch documents in many steps

7

Page 8: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Medline of National Library of Medicine

• Developed in late 1960 and made available in 1971• Based on inverted file organization• Boolean query language

– Queries broken down and numbered into segments– Results of a queries fed into the next query segment

• Each user assigned a time slot– If cycle not completed in time slot, most recent results are

returned• Query and browse operations performed as separate steps

– Following a query, results are viewed– Modifications start a new query-browse cycle

Page 9: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Dialog

• Broader subject content• Specialized collections of data on payment• Boolean query

– Each term numbered and executed separately then combined

– Word patterns– For multiword queries proximity operator W

Page 10: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Information Retrieval

• The indexing and retrieval of textual documents.

• Searching for pages on the World Wide Web is perhaps most widely used IR application

• Concerned firstly with retrieving relevant documents to a query.

• Concerned secondly with retrieving from large sets of documents efficiently.

Page 11: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Web Search as a Typical IR Task

• Given:– A corpus of textual natural-language documents.– A user query in the form of a textual string.

• Find:– A ranked set of documents that are relevant to

the query.

Page 12: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Typical IR System Architecture

IRSystem

Query String

Documentcorpus

RankedDocuments

1. Doc12. Doc23. Doc3 . .

Page 13: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Architecture of a Search Engine

The Web

Ad indexes

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links

CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Gatherer/Web Spider

Indexer

Indexes

Search

User

User interface

Searcher and evaluator

Page 14: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

14

How does it work?

• User Interface – Allows you to type a query and displays the results.

• Searcher – The engine searches the database for matching your query.

• Evaluator – The engine assigns scores/ranks to the retrieved information.

• Gatherer – The component that travels the WEB, and collects information.

• Indexer – The engine that categorizes the data collected by the gatherer.

Page 15: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

15

User Interface

• Provides a mechanism for a user to submit queries to the search engine.

• Uses forms, very user friendly.• The user interface displays the search results

in a convenient way.• A summary of each matched page is shown.

Page 16: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

16

Searcher

• It is a program that uses the search engine’s database to locate the matches for a specific query.

• The database of a search engine holds extremely large indexed pages.

• A highly efficient search algorithm is necessary.• Computer Scientists have spent years to develop the

searching and sorting methods.• You can refer to computer books.

Page 17: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

17

Evaluator

• The searcher returns a set of URLs that match your query.

• Not all of the hits equally match your query.• More references to the page, the ranking of the page

will be higher.• How the relevancy score is calculated?

– Varies from one engine to another one.– The number of times of the word appears?– The query words appear in the title?– The query words appear in the META tag?

Page 18: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

18

Web Spider/Gatherer

• It is a program that traverses the Web and gathers information about the Web documents.

• It runs at a short and regular intervals.• It returns information and will be indexed to

the database.• Alternate names: Bot, Crawler, Robot, Spider

and Worm.

Page 19: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Example Crawling Algorithms

Page 20: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

20

Indexer

• It organizes the data by creating a set of keys or an index.

• Indexes need to be rebuilt frequently.• E.g. Libraries – Author, Title, ISBN, etc…• In order to ensure the returned URL is not out of

date.• The search engine is very complex and needs to

break down into different components.

Page 21: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

21

Case Study - AltaVista

• Sending out Crawlers (robot programs) that capture information from the web and bring them back.

• The main crawler – “Scooter” simultaneously send out HTTP requests like blind users on the Web.

• Store all these information to the indexing engine.• Scooter’s cousins help to remove “dead” links.• A typical day, Scooter will visit over 10 million pages.• Web pages with no links referencing will never be found.• You can also submit your URL to AltaVista.

Page 22: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

2 Indexing Text for Search

Reduce retrieval time improve hit accuracy

Page 23: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Why Index

• Simplest approach search text sequentially– Size must be small

• Static, semi-static index• Inverted Index

– mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents.

• Documents/Positions in Documents/Weight• Fuzzy/Stemming/Stopwords

Page 24: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

What is an inverted index?

• T1 : "it is what it is“• T2 : "what is it“• T3 : "it is a banana"

• "a": {2} • "banana": {2} • "is": {0, 1, 2} • "it": {0, 1, 2} • "what": {0, 1}

Inverted Index

Example

Page 25: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

• "a": {(2, 2)} • "banana": {(2, 3)} • "is": {(0, 1), (0, 4), (1, 1), (2, 1)} • "it": {(0, 0), (0, 3), (1, 2), (2, 0)} • "what": {(0, 2), (1, 0)}

• T0 : "it is what it is“• T1 : "what is it“• T2 : "it is a banana"

Full Inverted Index

What is an inverted index?

Example

Page 26: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Periodically rebuilt, static otherwise.Documents are parsed to extract tokens. These

are saved with the Document ID.

How Inverted Files Are Created

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 27: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

• After all documents have been parsed the inverted file is sorted alphabetically.

How Inverted Files are Created

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 28: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Multiple term entries for a single document are merged.

Within-document term frequency information is compiled.

How InvertedFiles are Created

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Page 29: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

• Finally, the file can be split into – A Dictionary or Lexicon file

• set of all words in text, vocabulary

and – A Postings file

• each occurrence of the word in the text

How Inverted Files are Created

Page 30: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Dictionary/Lexicon Postings

How Inverted Files are CreatedTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Page 31: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

• Permit fast search for individual terms• For each term, you get a list consisting of:

– document ID – frequency of term in doc (optional) – position of term in doc (optional)

• These lists can be used to solve Boolean queries:• country -> d1, d2• manor -> d2• country AND manor -> d2

• Also used for statistical ranking algorithms

Inverted indexes

Page 32: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

• Inverted indexes are still used, even though the web is so huge.

• Some systems partition the indexes across different machines. Each machine handles different parts of the data.

• Other systems duplicate the data across many machines; queries are distributed among the machines.

• Most do a combination of these.

Inverted Indexes for Web Search Engines

Page 33: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Inverted Index : should we index all words?

Page 34: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

How to index words?

Page 35: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Inverted Index : web features

Page 36: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Google Index

• A unique DocId associated with each URL• Hit: word occurences

– wordID: 24 bit number– Word position– Font size relative to the rest of the document– Plain hit : in the document– Fancy hit : in the URL, title, anchor text, meta tags

• Word occurrences of a web page are distributed across a set of barrels

Page 37: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Architecture of the 1st Google Engine

Page 38: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Architecture of the 1st Google Engine

Page 39: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.
Page 40: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

3 Indexing Multimedia

Broadcast and compress for seamless delivery

Page 41: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Indexing Multimedia

• Forming an index for multimedia– Use context : surrounding

text– Add manual description– Analyze automatically and

attach a description

Page 42: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

4 Queries

Page 43: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Queries

• Keywords• Proximity• Patterns• Phrases• Ranges• Weights of keywords• Spelling mistakes

Page 44: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Queries

• Boolean query– No relevance measure– May be hard to understand

• Multimedia query– Find images of Everest– Find x-rays showing the human rib cage– Find companies whose stock prices have similar

patterns

Page 45: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Relevance

• Relevance is a subjective judgment and may include:– Being on the proper subject.– Being timely (recent information).– Being authoritative (from a trusted source).– Satisfying the goals of the user and his/her

intended use of the information (information need).

46

Page 46: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Keyword Search

• Simplest notion of relevance is that the query string appears verbatim in the document.

• Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

47

Page 47: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Problems with Keywords

• May not retrieve relevant documents that include synonymous terms.– “restaurant” vs. “café”– “PRC” vs. “China”

• May retrieve irrelevant documents that include ambiguous terms.– “bat” (baseball vs. mammal)– “Apple” (company vs. fruit)– “bit” (unit of data vs. act of eating)

48

Page 48: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Relevance Feedback

• User enters query terms– Keywords maybe weighted or not

• Links returned– Choose the relevant and irrelevant ones

• If there is no negative feedback second term is 0• T’s are terms from relevant and irrelevant sets

marked by the user

Page 49: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

SEARCHING AN INDEX 5Searching an Index

Page 50: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Searching an Inverted Index

• Tokenize the query, search index vocabulary for each query token

• Get a list of documents associated with each token

• Combine the list of documents using constraints specified in the query

Page 51: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Google Search

1. Tokenize query and remove stopwords2. Translate the query words into wordIDs using the lexicon3. For every wordID get the list of documents from the short

inverted barrel and build a composite set of documents4. Scan the composite list of documents

i. Skip to next document if the current document does not matchii. Compute a rank using query and featuresiii. If no more documents go to step 3 and use full inverted barrels

to find more docsiv. If there are sufficient # of docs go to step 5

5. Sort the final Document List by rank

Page 52: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

How are results ranked?

• Weight type• Location: title, URL, anchor, body• Size: relative font size• Capitalization• Count occurrences • Closeness (proximity)

Page 53: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Evaluation

• Response time • quality• Recall : % of correct items that are selected

• Precision : % of selected items that are correct

Page 54: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Ranking Algorithms : Hyperlink

• Popularity Ranking• Rank “popular” documents higher among set of

documents with specific keywords.• Determining “Popularity”

– Access rate ?• How to get accurate data?

– Bookmarks?• Might be private?

– Links to related pages?• Using web crawler to analyze external links.

Page 55: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Popularity/Prestige

• transfer of prestige– a link from a popular page x to a page y is treated

as conferring more prestige to page y than a link from a not-so-popular page z.

• Count of In-links/Out-links

Page 56: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Hypertext Induced Topic Search (HITS)

• The HITS algorithm:– compute popularity using set of related pages

only.• Important web pages : cited by other

important web pages or a large number of less-important pages

• Initially all pages have same importance

Page 57: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Hubs and Authorities

• Hub - A page that stores links to many related pages– may not in itself contain actual information on a topic

• Authority - A page that contains actual information on a topic – may not store links to many related pages

• Each page gets a prestige value as a hub (hub-prestige), and another prestige value as an authority (authority-prestige).

Page 58: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Hubs and Authorities in twitter

Page 59: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Hubs and Authorities algorithm

1. Locate and build the subgraph2. Assign initial values to hub and authority scores of each node3. Run a loop till convergence

i. Assign the sum of the hub scores of all nodes y that link to node x to the authority score of x

ii. Assign the sum of the authority scores of all nodes y that are linked from node x to node y to hub score of node x

iii. Normalize the hub and authority scores of all nodesiv. Check for convergence. Is the difference< threshold?

4. Return the list of nodes sorted in descending order of hub and authority scores

Page 60: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Page Rank Algorithm

• Ranks based on citation statistics– In/out links

• Rank of a page depends on the ranks of the pages that link to it.

Page 61: ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Page rank Algorithm

1. Locate and build subgraph2. Save the number of out-links from every node in an

array3. Assign a default PageRank to all nodes4. Run a loop till convergence

i. Compute a new PageRank score for every node. Assign the sum of PageRank scores divided by the number of out-links of every node that links to a node and add the default rank source

ii. Check convergence. Is the difference between new and old PageRank< threshold?