T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Web IR
date post
20-Dec-2015Category
Documents
view
213download
0
Embed Size (px)
Transcript of T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Web IR
- Slide 1
- T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Web IR
- Slide 2
- 2 T.Sharon - A.Frank Web IR Whats Different about Web IR? Web IR Queries How to Compare Web Search Engines? The HITS Scoring Method
- Slide 3
- 3 T.Sharon - A.Frank Whats different about the Web? Bulk ... (500M); growth at 20M/month Lack of Stability.. Estimates: 1%/day--1%/week Heterogeneity Types of documents . text, pictures, audio, scripts... Quality Document Languages . 100+ Duplication Non-running text High Linkage... 8 links/page average > =
- Slide 4
- 4 T.Sharon - A.Frank Taxonomy of Web Document Languages SGML HyTime XML Metalanguages Languages SMILMathMLRDFXHTML HTMLTEI Lite DSSSL XSL CSS Style sheets
- Slide 5
- 5 T.Sharon - A.Frank Non-running Text
- Slide 6
- 6 T.Sharon - A.Frank Whats different about the Web Users? Make poor queries short (2.35 terms average) imprecise terms sub-optimal syntax (80% without operators) low effort Wide variance on Needs Expectations Knowledge Bandwidth Specific behavior 85% look over one result screen only 78% of queries not modified
- Slide 7
- 7 T.Sharon - A.Frank Why dont the Users get what they Want? User need User request (verbalized) Query to IR system Results Translation problems Polysemy Synonymy Example I need to get rid of mice in the basement What s the best way to trap mice alive? Mouse trap Computer supplies software, etc
- Slide 8
- 8 T.Sharon - A.Frank Alta Vista: Mouse trap
- Slide 9
- 9 T.Sharon - A.Frank Alta Vista: Mice trap
- Slide 10
- 10 T.Sharon - A.Frank Challenges on the Web Distributed data Dynamic data Large volume Unstructured and redundant data Data quality Heterogeneous data
- Slide 11
- 11 T.Sharon - A.Frank Web IR Advantages High Linkage Interactivity Statistics easy to gather large sample sizes
- Slide 12
- 12 T.Sharon - A.Frank Evaluation in the Web Context Quality of pages varies widely Relevance is not enough We need both relevance and high quality = value of page
- Slide 13
- 13 T.Sharon - A.Frank Example of Web IR Query Results
- Slide 14
- 14 T.Sharon - A.Frank How to Compare Web Search Engines? Search engines hold huge repositories! Search engines hold different resources! Solution: Precision at top 10 % of top 10 pages that are relevant (ranking quality) Retrieved (Ret) Resource s RR Relevant Returned
- Slide 15
- 15 T.Sharon - A.Frank The HITS Scoring Method New method from 1998: improved quality reduced number of retrieved documents Based on the Web high linkage Simplified implementation in Google (www.google.com) Advanced implementation in Clever Reminder: Hypertext - nonlinear graph structure
- Slide 16
- 16 T.Sharon - A.Frank HITS Definitions Authorities: good sources of content Hubs: good sources of links A H
- Slide 17
- 17 T.Sharon - A.Frank HITS Intuition Authority comes from in-edges. Being a hub comes from out- edges. Better authority comes from in-edges from hubs. Being a better hub comes from out-edges to authorities. AH A H H H H A A A
- Slide 18
- 18 T.Sharon - A.Frank v HITS Algorithm A w1w1 H w2w2 wkwk... u1u1 u2u2 ukuk Repeat until HUB and AUTH converge: Normalize HUB and AUTH HUB[v] := AUTH[u i ] for all u i with Edge(v,u i ) AUTH[v] := HUB[w i ] for all w i with Edge(w i,v)
- Slide 19
- 19 T.Sharon - A.Frank Google Output: Princess Diana
- Slide 20
- 20 T.Sharon - A.Frank Prototype Implementation (Clever) Base Root 1. Selecting documents using index (root) 2. Adding linked documents 3. Iterating to find hubs and authorities
- Slide 21
- 21 T.Sharon - A.Frank By-products Separates Web sites into clusters. Reveals the underlying structure of the World Wide Web.