Introduction into Search Engines and Information Retrieval
-
Upload
a-le -
Category
Technology
-
view
158 -
download
0
description
Transcript of Introduction into Search Engines and Information Retrieval
![Page 1: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/1.jpg)
Search Engines
Google & Co. vs Internet
An Introduction to Information Retrieval
![Page 2: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/2.jpg)
Contents
Overview History Introduction to Information Retrieval Page Rank in Example Google & Co.
![Page 3: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/3.jpg)
Search Engines Overview
deep impact (not only for search) developers in big challenge search engines getting larger problems not new
![Page 4: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/4.jpg)
History
The web happened (1992) Mosaic/Netscape happened (1993-95) Crawler happened (1994): M. Mauldin SEs happened 1994-1996
– InfoSeek, Lycos, Altavista, Excite, Inktomi, … Yahoo decided to go with a directory Google happened 1996-98
Tried selling technology to other engines SEs though search was a commodity, portals were in Microsoft said: whatever …
![Page 5: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/5.jpg)
Present
Most search engines have vanished Google is a big player Yahoo decided to de-emphasize directories
Buys three search engines Microsoft realized Internet is here to stay
Dominates the browser market Realizes search is critical
![Page 6: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/6.jpg)
Share Of Searches: July 2005
![Page 7: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/7.jpg)
first launched Sep. 1999 Over 4 billion pages by beginning of 2004 strengths
size and scope relevance based cached archive
weaknesses limited search features only indexes first 101KB of sites and PDFs
![Page 8: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/8.jpg)
Yahoo!
David Filo, Jerry Yang => 1995 originally just a subject directory strengths
large, new(Feb. 2004) database cached copies support of full boolean searching
weaknesses lack of some advanced search features indexes only the first 500KB tricky wildcard
![Page 9: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/9.jpg)
MSN Search
used to use third party db´s Feb. 2005 began using own db strenghts
large, unique database cached copies including data cached
weaknesses limited advanced features no title search, truncation, stemming
![Page 10: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/10.jpg)
How Search Engines Work
Crawler-Based Search Engines listing created automatically
Human-Powered Directories contents filled by hand
"Hybrid Search Engines" Or Mixed Results best of both worlds
![Page 11: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/11.jpg)
Ranking Of Sites
location and frequency of keywords keywords near top of page spamming filter „off the page“ ranking
link structure filtering fake links clickthrough measurement
![Page 12: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/12.jpg)
Search Engine Placement Tips (1)
pick your target keywords position your keywords have relevant content avoid search engine stumbling blocks
have html links frames can kill dynamic doorblocks
![Page 13: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/13.jpg)
Search Engine Placement Tips (2)
build links just say no to search engine spamming submit your key pages verify & maintain your listing
beyond search engines
![Page 14: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/14.jpg)
Features for webmasters
Crawling Yes No Notes
Deep CrawlAllTheWeb, Google,
InktomiAltaVista, Teoma
Frames Support All n/a
Robots.txt All n/a
Meta Robots Tag All n/a
Paid Inclusion All but… Google
Full Body Text All n/aSome stop words may
not be indexed
Stop WordsAltaVista, Inktomi,
GoogleFAST Teoma unkown
Meta DescriptionAll provide some support, but AltaVista, AllTheWeb and Teoma make most
use of the tag
Meta Keywords Inktomi, TeomaAllTheWeb, Altavista,
GoogleTeoma support is
„unofficial“
ALT textAltaVista, Google,
TeomaAllTheWeb, Inktomi
Comments Inktomi Others
![Page 15: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/15.jpg)
What is Information Retrieval?
Informations get lost in the amount of documents, but have to be relocated
Definition: IR is the field, that deals with the relocation of
information/knowledge out of large document database.
![Page 16: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/16.jpg)
Quality of an IR-System (1)
Precision: Is the ratio of the relevant documents retrieved
to the total number of documents retrieved.
Precision = 1: all retrieved documents are relevant
= [0;1]
![Page 17: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/17.jpg)
Quality of an IR-System (2)
Recall: Is the ratio of the number of relevant
documents retrieved to the total number of relevant documents (retrieved and not).
Recall = 1: all relevant documents were found
= [0;1]
![Page 18: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/18.jpg)
Quality of an IR-System (3)
Aim of a good IR-System: increasing Precision and Recall!
Problem: increasing Precision cause a decrease of Recall
e.g.: search results 1 document:
Recall->0, Precision=1
increasing Recall cause a decrease of Precision e.g. search results all available documents
Recall=1, Precision->0
![Page 19: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/19.jpg)
Mathematical models
Boolean Model
Vector Space Model
![Page 20: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/20.jpg)
Boolean model
checks if the document includes the search term (true) or not (false). True means, the document is relevant
Problem: high variation on the result size, depending on
the search term no ranking on result set -> no sort possible “relevance” criteria is too strict (e.g. AND,OR)
![Page 21: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/21.jpg)
Vector space model (1)
index weighted vector
search weighted vector
analyze the angle between search vector and document vector by using the cosine function
the smaller the angle, the more relevant is the document -> use it for ranking
),,,( ,,3,2,1 jnjjjj wwwwd
),,,( ,,3,2,1 qnqqq wwwwq
![Page 22: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/22.jpg)
Vector space model (2)
“relevance” criteria is more tolerant no use of boolean operators uses weighting creates a ranking -> sort is possible
Problem: automatic weighting of index terms in queries
and documents
![Page 23: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/23.jpg)
Weighting Methods (1)
law of Zipf global weighting (IDF “inverse document
frequency”) considers the distribution of words in a
language filters out words like “or”, “and” (words with
large occurrence) and weights them weakly
)/log( nNIDF
N = Number of documents in the system
n = number of documents including the index term
![Page 24: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/24.jpg)
Weighting Methods (2)
local weighting considers term frequency into documents weighting corresponding to the frequency regards different length of documents and
normalize the term frequency
jlnl
ji
ji
tf
tfntf
,...1
,
,
max
= absolute number of term frequency in a document jitf , idit
![Page 25: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/25.jpg)
Weighting Methods (3)
tf-idf weighting combination of global (inverse document
frequency) and local (normalized term frequency) weighting
ijiji idfntfw ,,
![Page 26: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/26.jpg)
Web-Mining
Web-Mining ≈ Data-Mining, different problems
Mining of: Content, Structure or User Content-Mining: VSM,BM Structure-Mining: Analysis of Structure User-Mining: Infos about User of a page
Let‘s have a deeper look at Web-Structure-Mining!
![Page 27: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/27.jpg)
History
IR necessary but not sufficient for web search Doesn’t address web navigation
Query ibm seeks www.ibm.com To IR www.ibm.com may look less topical than a
quarterly report Link analysis
Hubs and authority (Jon Kleinberg) PageRank (Brin and Page)
Computed on the entire graph Query independent Faster if serving lots of queries
Others…
![Page 28: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/28.jpg)
Analysis of Hyperlinks
Links Long history in citation analysis Navigational tools on the web Also a sign of popularity Can be thought of as recommendations
(source recommends destination) Also describe the destination: anchor text
Idea: The exist of a Hyperlink between two pages can also give Information
Hyperlinks can be used to: Create a weighting of web pages Find pages with similiar topics Group pages by different context of meaning
![Page 29: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/29.jpg)
Hubs and Authorities
Describe the qualitiy of a website
Authorities: pages which is linked very often
Hubs: pages which are linking other pages very often
Example: Authority: Heise.de Hub: Peter‘s Linklist
![Page 30: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/30.jpg)
Page Rank
Invented by Lawrence Page a. Sergey Brin Algorithm itself is well-described Implementations are not (Google) Main Idea:
relationship of all Links in WWW The more a document is linked, the more important it is Not every link counts the same – a link from an
important page has more worth
![Page 31: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/31.jpg)
Page Rank Algorithm
PR(p0) : Page Rank of a page
PR(pi) : Page Rank of pages linking to p0
outlink(pi): All outgoing links of pi
q = Random walks (normally q=0,85) Attention: Recursive Function!
![Page 32: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/32.jpg)
Page Rank Example
PR(A) = 0.5 + 0.5 PR(C)PR(B) = 0.5 + 0.5 (PR(A) / 2)PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))
PR(A) = 14/13 = 1.07692308PR(B) = 10/13 = 0.76923077PR(C) = 15/13 = 1.15384615
with q=0.5
![Page 33: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/33.jpg)
Page Rank Calculation
Solution of system of equation not possible Iterative Calcuation of Page Rank necessary Each page starts with 1
![Page 34: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/34.jpg)
Page Rank Incoming Links
PR(A) = 0.5 + 0.5 (PR(X) + PR(D)) = 5.5 + 0.5 PR(D)PR(B) = 0.5 + 0.5 PR(A)PR(C) = 0.5 + 0.5 PR(B)PR(D) = 0.5 + 0.5 PR(C)
PR(A) = 19/3 = 6.33PR(B) = 11/3 = 3.67PR(C) = 7/3 = 2.33PR(D) = 5/3 = 1.67
Given PR(A) = PR(B) = PR(C) = PR(D) = 1 PR(X) = 10
![Page 35: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/35.jpg)
Page Rank Outgoing Links
PR(A) = 0.25 + 0.75 PR(B)PR(B) = 0.25 + 0.375 PR(A)PR(C) = 0.25 + 0.75 PR(D) + 0.375 PR(A)PR(D) = 0.25 + 0.75 PR(C)
PR(A) = 14/23PR(B) = 11/23 PR(C) = 35/23PR(D) = 32/23
![Page 36: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/36.jpg)
Page Rank other Examples
Dangling Links
Different hierachies
![Page 37: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/37.jpg)
Page Rank Implementation
Normally implemented as weighting system Additional content-search needed for
retrieving the document set Also involved in Page Rank
The markup of a link The position of a link in the document The distance between the pages (e.g. other
domain) The context of the linking page The actuality of the page
![Page 38: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/38.jpg)
Google Past
1995 research project at Stanford University
![Page 39: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/39.jpg)
Google Past
One of the earliest storage systems
![Page 40: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/40.jpg)
Google – How it began
Peak of google.stanford.edu
![Page 41: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/41.jpg)
Servers 1999
![Page 42: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/42.jpg)
![Page 43: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/43.jpg)
Google by Numbers
Index: 40 TB (4 Bill. Pages with est. Size 10 kb) Up to 2000 Servers in one Cluster Over 30 Cluster One Petabyte Data per Cluster – so much that a
quota of hard disk breakdowns with 1 in 10-15 Bits gets a real problem
Each day in each greater cluster normally two servers will breakdown
System running stable (without any breakdowns) since February 2000 (Yes, they don’t use Windows server…)
![Page 44: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/44.jpg)
Look-out: Semantic Web
Information should be read by men & machine
Unified description of data & knowledge First approaches: Meta-Data, e.g. Dublin
Core
Actual: RDF
![Page 45: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/45.jpg)
Look-out: Personalized Search Engine
A new approach: personalized Search Engines
Advantage: Only get in what you‘re personally interested
Disadvantage: A lot of data has to be collected
Example: www.fooxx.com
![Page 46: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/46.jpg)
Links
www.searchenginewatch.com (common Information about search engines)
http://pr.efactory.de (page rank algorithm)
http://zdnet.de/itmanager/unternehmen/0,39023441,39129811-2,00.htm (article: “Google’s Technologien: Von Zauberei kaum zu unterscheiden”)
![Page 47: Introduction into Search Engines and Information Retrieval](https://reader030.fdocuments.net/reader030/viewer/2022012818/54c91d5d4a7959a72d8b4584/html5/thumbnails/47.jpg)
The End
Thank you for your attention