T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Web IR

Click here to load reader

  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Embed Size (px)

Transcript of T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Web IR

  • Slide 1
  • T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Web IR
  • Slide 2
  • 2 T.Sharon - A.Frank Web IR Whats Different about Web IR? Web IR Queries How to Compare Web Search Engines? The HITS Scoring Method
  • Slide 3
  • 3 T.Sharon - A.Frank Whats different about the Web? Bulk ... (500M); growth at 20M/month Lack of Stability.. Estimates: 1%/day--1%/week Heterogeneity Types of documents . text, pictures, audio, scripts... Quality Document Languages . 100+ Duplication Non-running text High Linkage... 8 links/page average > =
  • Slide 4
  • 4 T.Sharon - A.Frank Taxonomy of Web Document Languages SGML HyTime XML Metalanguages Languages SMILMathMLRDFXHTML HTMLTEI Lite DSSSL XSL CSS Style sheets
  • Slide 5
  • 5 T.Sharon - A.Frank Non-running Text
  • Slide 6
  • 6 T.Sharon - A.Frank Whats different about the Web Users? Make poor queries short (2.35 terms average) imprecise terms sub-optimal syntax (80% without operators) low effort Wide variance on Needs Expectations Knowledge Bandwidth Specific behavior 85% look over one result screen only 78% of queries not modified
  • Slide 7
  • 7 T.Sharon - A.Frank Why dont the Users get what they Want? User need User request (verbalized) Query to IR system Results Translation problems Polysemy Synonymy Example I need to get rid of mice in the basement What s the best way to trap mice alive? Mouse trap Computer supplies software, etc
  • Slide 8
  • 8 T.Sharon - A.Frank Alta Vista: Mouse trap
  • Slide 9
  • 9 T.Sharon - A.Frank Alta Vista: Mice trap
  • Slide 10
  • 10 T.Sharon - A.Frank Challenges on the Web Distributed data Dynamic data Large volume Unstructured and redundant data Data quality Heterogeneous data
  • Slide 11
  • 11 T.Sharon - A.Frank Web IR Advantages High Linkage Interactivity Statistics easy to gather large sample sizes
  • Slide 12
  • 12 T.Sharon - A.Frank Evaluation in the Web Context Quality of pages varies widely Relevance is not enough We need both relevance and high quality = value of page
  • Slide 13
  • 13 T.Sharon - A.Frank Example of Web IR Query Results
  • Slide 14
  • 14 T.Sharon - A.Frank How to Compare Web Search Engines? Search engines hold huge repositories! Search engines hold different resources! Solution: Precision at top 10 % of top 10 pages that are relevant (ranking quality) Retrieved (Ret) Resource s RR Relevant Returned
  • Slide 15
  • 15 T.Sharon - A.Frank The HITS Scoring Method New method from 1998: improved quality reduced number of retrieved documents Based on the Web high linkage Simplified implementation in Google (www.google.com) Advanced implementation in Clever Reminder: Hypertext - nonlinear graph structure
  • Slide 16
  • 16 T.Sharon - A.Frank HITS Definitions Authorities: good sources of content Hubs: good sources of links A H
  • Slide 17
  • 17 T.Sharon - A.Frank HITS Intuition Authority comes from in-edges. Being a hub comes from out- edges. Better authority comes from in-edges from hubs. Being a better hub comes from out-edges to authorities. AH A H H H H A A A
  • Slide 18
  • 18 T.Sharon - A.Frank v HITS Algorithm A w1w1 H w2w2 wkwk... u1u1 u2u2 ukuk Repeat until HUB and AUTH converge: Normalize HUB and AUTH HUB[v] := AUTH[u i ] for all u i with Edge(v,u i ) AUTH[v] := HUB[w i ] for all w i with Edge(w i,v)
  • Slide 19
  • 19 T.Sharon - A.Frank Google Output: Princess Diana
  • Slide 20
  • 20 T.Sharon - A.Frank Prototype Implementation (Clever) Base Root 1. Selecting documents using index (root) 2. Adding linked documents 3. Iterating to find hubs and authorities
  • Slide 21
  • 21 T.Sharon - A.Frank By-products Separates Web sites into clusters. Reveals the underlying structure of the World Wide Web.