ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Post on 18-Jan-2018

220 views 0 download

description

Search Engine Characteristics Unedited – anyone can enter content Quality issues; Spam Varied information types Phone book, brochures, catalogs, dissertations, news reports, weather, all in one place! Different kinds of users – Lexis-Nexis: Paying, professional searchers – Online catalogs: Scholars searching scholarly literature – Web: Every type of person with every type of goal Scale – Hundreds of millions of searches/day; billions of docs

Transcript of ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

ITEC547 Text Mining Fall 2015-16

Overview of

Search Engines

Outline of Presentation

1Early Search Engines

2Indexing Text

for Search

3Indexing

Multimedia

4Queries

5Searching an Index

Search Engine Characteristics

• Unedited – anyone can enter content• Quality issues; Spam

• Varied information types• Phone book, brochures, catalogs, dissertations, news

reports, weather, all in one place!

• Different kinds of users– Lexis-Nexis: Paying, professional searchers– Online catalogs: Scholars searching scholarly literature– Web: Every type of person with every type of goal

• Scale– Hundreds of millions of searches/day; billions of docs

Web Search Queries

• Web search queries are short:– ~2.4 words on average (Aug 2000)– Has increased, was 1.7 (~1997)

• User Expectations:– Many say “The first item shown should be what I

want to see!”– This works if the user has the most

popular/common notion in mind, not otherwise.

Background

History, Problems, Solutions …

1

Search Engines

• Open Text (1995-1997)• Magellan (1995-2001)• Infoseek (Go) (1995-2001)• Snap (NBCi)(1997-2001)• Direct Hit (1998-2002)• Lycos (1994; reborn 1999)• WebCrawler (1994; reborn 2001)• Yahoo (1994; reborn 2002)• Excite (1995; reborn 2001)• HotBot (1996; reborn 2002)• Ask Jeeves (1998; reborn 2002)• Teoma (2000- 2001)• AltaVista (1995- )• LookSmart (1996- )• Overture (1998- )

6

EARLY SEARCH ENGINES

• Initially used in academic or specialized domains.– Legal and specialized domains consume a large

amount of textual info• Use of expensive proprietary hardware and

software– High computational and storage requirements

• Boolean query model• Iterative search model

– Fetch documents in many steps

7

Medline of National Library of Medicine

• Developed in late 1960 and made available in 1971• Based on inverted file organization• Boolean query language

– Queries broken down and numbered into segments– Results of a queries fed into the next query segment

• Each user assigned a time slot– If cycle not completed in time slot, most recent results are

returned• Query and browse operations performed as separate steps

– Following a query, results are viewed– Modifications start a new query-browse cycle

Dialog

• Broader subject content• Specialized collections of data on payment• Boolean query

– Each term numbered and executed separately then combined

– Word patterns– For multiword queries proximity operator W

Information Retrieval

• The indexing and retrieval of textual documents.

• Searching for pages on the World Wide Web is perhaps most widely used IR application

• Concerned firstly with retrieving relevant documents to a query.

• Concerned secondly with retrieving from large sets of documents efficiently.

Web Search as a Typical IR Task

• Given:– A corpus of textual natural-language documents.– A user query in the form of a textual string.

• Find:– A ranked set of documents that are relevant to

the query.

Typical IR System Architecture

IRSystem

Query String

Documentcorpus

RankedDocuments

1. Doc12. Doc23. Doc3 . .

Architecture of a Search Engine

The Web

Ad indexes

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links

CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Gatherer/Web Spider

Indexer

Indexes

Search

User

User interface

Searcher and evaluator

14

How does it work?

• User Interface – Allows you to type a query and displays the results.

• Searcher – The engine searches the database for matching your query.

• Evaluator – The engine assigns scores/ranks to the retrieved information.

• Gatherer – The component that travels the WEB, and collects information.

• Indexer – The engine that categorizes the data collected by the gatherer.

15

User Interface

• Provides a mechanism for a user to submit queries to the search engine.

• Uses forms, very user friendly.• The user interface displays the search results

in a convenient way.• A summary of each matched page is shown.

16

Searcher

• It is a program that uses the search engine’s database to locate the matches for a specific query.

• The database of a search engine holds extremely large indexed pages.

• A highly efficient search algorithm is necessary.• Computer Scientists have spent years to develop the

searching and sorting methods.• You can refer to computer books.

17

Evaluator

• The searcher returns a set of URLs that match your query.

• Not all of the hits equally match your query.• More references to the page, the ranking of the page

will be higher.• How the relevancy score is calculated?

– Varies from one engine to another one.– The number of times of the word appears?– The query words appear in the title?– The query words appear in the META tag?

18

Web Spider/Gatherer

• It is a program that traverses the Web and gathers information about the Web documents.

• It runs at a short and regular intervals.• It returns information and will be indexed to

the database.• Alternate names: Bot, Crawler, Robot, Spider

and Worm.

Example Crawling Algorithms

20

Indexer

• It organizes the data by creating a set of keys or an index.

• Indexes need to be rebuilt frequently.• E.g. Libraries – Author, Title, ISBN, etc…• In order to ensure the returned URL is not out of

date.• The search engine is very complex and needs to

break down into different components.

21

Case Study - AltaVista

• Sending out Crawlers (robot programs) that capture information from the web and bring them back.

• The main crawler – “Scooter” simultaneously send out HTTP requests like blind users on the Web.

• Store all these information to the indexing engine.• Scooter’s cousins help to remove “dead” links.• A typical day, Scooter will visit over 10 million pages.• Web pages with no links referencing will never be found.• You can also submit your URL to AltaVista.

2 Indexing Text for Search

Reduce retrieval time improve hit accuracy

Why Index

• Simplest approach search text sequentially– Size must be small

• Static, semi-static index• Inverted Index

– mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents.

• Documents/Positions in Documents/Weight• Fuzzy/Stemming/Stopwords

What is an inverted index?

• T1 : "it is what it is“• T2 : "what is it“• T3 : "it is a banana"

• "a": {2} • "banana": {2} • "is": {0, 1, 2} • "it": {0, 1, 2} • "what": {0, 1}

Inverted Index

Example

• "a": {(2, 2)} • "banana": {(2, 3)} • "is": {(0, 1), (0, 4), (1, 1), (2, 1)} • "it": {(0, 0), (0, 3), (1, 2), (2, 0)} • "what": {(0, 2), (1, 0)}

• T0 : "it is what it is“• T1 : "what is it“• T2 : "it is a banana"

Full Inverted Index

What is an inverted index?

Example

Periodically rebuilt, static otherwise.Documents are parsed to extract tokens. These

are saved with the Document ID.

How Inverted Files Are Created

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

• After all documents have been parsed the inverted file is sorted alphabetically.

How Inverted Files are Created

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Multiple term entries for a single document are merged.

Within-document term frequency information is compiled.

How InvertedFiles are Created

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

• Finally, the file can be split into – A Dictionary or Lexicon file

• set of all words in text, vocabulary

and – A Postings file

• each occurrence of the word in the text

How Inverted Files are Created

Dictionary/Lexicon Postings

How Inverted Files are CreatedTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

• Permit fast search for individual terms• For each term, you get a list consisting of:

– document ID – frequency of term in doc (optional) – position of term in doc (optional)

• These lists can be used to solve Boolean queries:• country -> d1, d2• manor -> d2• country AND manor -> d2

• Also used for statistical ranking algorithms

Inverted indexes

• Inverted indexes are still used, even though the web is so huge.

• Some systems partition the indexes across different machines. Each machine handles different parts of the data.

• Other systems duplicate the data across many machines; queries are distributed among the machines.

• Most do a combination of these.

Inverted Indexes for Web Search Engines

Inverted Index : should we index all words?

How to index words?

Inverted Index : web features

Google Index

• A unique DocId associated with each URL• Hit: word occurences

– wordID: 24 bit number– Word position– Font size relative to the rest of the document– Plain hit : in the document– Fancy hit : in the URL, title, anchor text, meta tags

• Word occurrences of a web page are distributed across a set of barrels

Architecture of the 1st Google Engine

Architecture of the 1st Google Engine

3 Indexing Multimedia

Broadcast and compress for seamless delivery

Indexing Multimedia

• Forming an index for multimedia– Use context : surrounding

text– Add manual description– Analyze automatically and

attach a description

4 Queries

Queries

• Keywords• Proximity• Patterns• Phrases• Ranges• Weights of keywords• Spelling mistakes

Queries

• Boolean query– No relevance measure– May be hard to understand

• Multimedia query– Find images of Everest– Find x-rays showing the human rib cage– Find companies whose stock prices have similar

patterns

Relevance

• Relevance is a subjective judgment and may include:– Being on the proper subject.– Being timely (recent information).– Being authoritative (from a trusted source).– Satisfying the goals of the user and his/her

intended use of the information (information need).

46

Keyword Search

• Simplest notion of relevance is that the query string appears verbatim in the document.

• Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

47

Problems with Keywords

• May not retrieve relevant documents that include synonymous terms.– “restaurant” vs. “café”– “PRC” vs. “China”

• May retrieve irrelevant documents that include ambiguous terms.– “bat” (baseball vs. mammal)– “Apple” (company vs. fruit)– “bit” (unit of data vs. act of eating)

48

Relevance Feedback

• User enters query terms– Keywords maybe weighted or not

• Links returned– Choose the relevant and irrelevant ones

• If there is no negative feedback second term is 0• T’s are terms from relevant and irrelevant sets

marked by the user

SEARCHING AN INDEX 5Searching an Index

Searching an Inverted Index

• Tokenize the query, search index vocabulary for each query token

• Get a list of documents associated with each token

• Combine the list of documents using constraints specified in the query

Google Search

1. Tokenize query and remove stopwords2. Translate the query words into wordIDs using the lexicon3. For every wordID get the list of documents from the short

inverted barrel and build a composite set of documents4. Scan the composite list of documents

i. Skip to next document if the current document does not matchii. Compute a rank using query and featuresiii. If no more documents go to step 3 and use full inverted barrels

to find more docsiv. If there are sufficient # of docs go to step 5

5. Sort the final Document List by rank

How are results ranked?

• Weight type• Location: title, URL, anchor, body• Size: relative font size• Capitalization• Count occurrences • Closeness (proximity)

Evaluation

• Response time • quality• Recall : % of correct items that are selected

• Precision : % of selected items that are correct

Ranking Algorithms : Hyperlink

• Popularity Ranking• Rank “popular” documents higher among set of

documents with specific keywords.• Determining “Popularity”

– Access rate ?• How to get accurate data?

– Bookmarks?• Might be private?

– Links to related pages?• Using web crawler to analyze external links.

Popularity/Prestige

• transfer of prestige– a link from a popular page x to a page y is treated

as conferring more prestige to page y than a link from a not-so-popular page z.

• Count of In-links/Out-links

Hypertext Induced Topic Search (HITS)

• The HITS algorithm:– compute popularity using set of related pages

only.• Important web pages : cited by other

important web pages or a large number of less-important pages

• Initially all pages have same importance

Hubs and Authorities

• Hub - A page that stores links to many related pages– may not in itself contain actual information on a topic

• Authority - A page that contains actual information on a topic – may not store links to many related pages

• Each page gets a prestige value as a hub (hub-prestige), and another prestige value as an authority (authority-prestige).

Hubs and Authorities in twitter

Hubs and Authorities algorithm

1. Locate and build the subgraph2. Assign initial values to hub and authority scores of each node3. Run a loop till convergence

i. Assign the sum of the hub scores of all nodes y that link to node x to the authority score of x

ii. Assign the sum of the authority scores of all nodes y that are linked from node x to node y to hub score of node x

iii. Normalize the hub and authority scores of all nodesiv. Check for convergence. Is the difference< threshold?

4. Return the list of nodes sorted in descending order of hub and authority scores

Page Rank Algorithm

• Ranks based on citation statistics– In/out links

• Rank of a page depends on the ranks of the pages that link to it.

Page rank Algorithm

1. Locate and build subgraph2. Save the number of out-links from every node in an

array3. Assign a default PageRank to all nodes4. Run a loop till convergence

i. Compute a new PageRank score for every node. Assign the sum of PageRank scores divided by the number of out-links of every node that links to a node and add the default rank source

ii. Check convergence. Is the difference between new and old PageRank< threshold?