Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re...

42
Search and Discovery: Searching the Web
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re...

Page 1: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Search and Discovery:Searching the Web

Page 2: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Stages of a transaction

• Discovery– Find what you’re interested in

• Locate sellers

• Locate buyers

• Compare products

• Negotiation

• Exchange

Page 3: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Discovery

• Encompasses:– Search engines– Recommender systems– Price comparison/shopping agents– Description languages– Data sources

• Generic sources: portals, web directories

• Domain-specific sources: catalogs, guides, etc.

– Advertising

Page 4: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Discovery

• More than just finding a resource– Need to be able to estimate value, likelihood of

successful negotiation

– An evaluative infrastructure is required

• Least formalized of e-commerce subareas.• Unlikely to have a general-purpose solution soon

– Too complex

Page 5: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

A Brief History of the Web

• Prehistory:– Hypertext as an idea has been around since the

40s.• Vannevar Bush: Memex

• Engelbart: 60s

– 1987: Hypercard• Graphical tool allowing users to create hyperlinked

documents.

– Late 80s/early 90s: WAIS, Gopher

Page 6: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

A Brief History of the Web

• 1989/90: Tim Berners-Lee proposes the WWW at CERN– A new global information retrieval system– Develops HTML, a simple markup language

• 1993: Mosaic developed at NCSA– Marc Andressen then founds Netscape

• 1993/94: NCSA httpd released– Open-source web server, supported CGI– Precursor to Apache

Page 7: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

A Brief History of the Web

• 1994: Banner ads appear on HotWired– Beginning of the commercial web

• 1994: Yahoo founded – Appearance of the portal, search engine

• 1995: NSF backbone privatized– AT&T, Sprint, etc take over traffic– Network Solutions given a monopoly on domain names

• 1995: Microsoft releases Internet Explorer– In 7 years, Netscape goes from 100% market share to

20% (2001).

Page 8: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

A Brief History of the Web

• 1995: AltaVista started– Full-text Web search

• 1995: Andressen first WWW billionaire• 1995: Sun introduces Java

– Able to ship code and text across networks

• 1995: eBay founded– First online auction

• 1995-98: Explosive growth– Many new formats, applications, companies

• 1998: Akamai founded (web caching)

Page 9: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

A Brief History of the Web

• 1998: ICANN governs names & addresses• 1998: MP3 format popularized

– WinAmp released– Small enough to make audio distribution practical

• 1998: Google founded.• 2000: Napster appears

– Beginnings of peer-to-peer technology, file sharing

• 2000(ish): End of the boom– Consolidation, reduction in growth

Page 10: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Lessons from Radio

• Radio was popularized in the 1920s– Originally intended as a one-to-one messaging

system.– Fee-for-use pay structure.

• 1922: Explosive growth begins– RCA’s revenues from sales of receivers

doubled each year– Broadcast model becomes prevalent– Thousands of broadcasters emerge

Page 11: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Lessons From Radio

• 1922-1924: Transition– How to make money broadcasting?

• Support sale of receivers• Goodwill (sponsors)• Public good – supported as a non-profit• Advertising• Tube tax/set tax (a la BBC)

– By 1924, stations are failing as quickly as they start.

Page 12: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Lessons From Radio

• Affordable content driven by audience size• “Rich-get-richer” for large stations• 1926: RCA launches NBC

– First nationwide broadcast– Creates the network system

• National content, local broadcasting

– Advertising the dominant revenue generator

• WWW questions: – Who will be NBC?– What will the revenue model be?

• Advertising? Competition with TV, radio for this revenue.• Micropayments? Subscriptions? Content aggregation?

Page 13: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Searching the Web

• Web growth estimated at 1000% in late 90s.• Can search engines keep up with this

growth?• How to deal with the dynamic nature of the

web?– Page contents change– Pages appear, disappear, move– Link structure changes

Page 14: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Search Engines

• Most common form of discovery• Crawl the web to collect pages• Stored and indexed for easy retrieval• Query languages simple• Goals:

– Fast retrieval (Google gets 150 million queries per day)

– Accurate (no dead links)– Precise (pages match user’s needs)

Page 15: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Terminology

• Outward link– Object that a page links to

• Outdegree: number of outward links• Inward link

– Pages that link to an object

• Indegree: number of inward links• Path

– Series of outward links from A to B

Page 16: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

The Web as a Directed Graph

• We can represent the web as a directed graph.– Sites are nodes– Links are edges.

• Outward link– Object that a page links to

• Inward link– Pages that link to an object

Page 17: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

The Web as a Directed Graph

Page 18: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Adjacency Matrix

• We can also represent the Web as a very large adjacency matrix.

• The eigenvector of this matrix illustrates the clusteredness of the Web– Distribution of in-degree and out-degree– Connectedness– Some ranking algorithms (HITS) use this

measure.

Page 19: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Web structure

• Web can be broken into four areas (Kleinberg/Lawrence)– Core: Path between any two pages

– Upstream: Can reach the core, but no path from core.

– Downstream: can be reached from core, but cannot reach core.

– Tendrils/islands – disconnected from the core.

• Areas (allegedly) have roughly equal size.

Page 20: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Coverage

• Search engines claim they index a large fraction of the web.

• How to verify this?1. Run queries on many engines and

compare number of hits.• May return irrelevant documents• Documents may no longer exist• Documents may have changed

Page 21: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Coverage

• NEC (1998) – Estimate size of web, coverage for major search engines.– Query each engine, retrieve and compare all

results (only exact matches).

• Coverage estimates:– HotBot: 57%, AltaVista: 46%– NorthernLight: 33%, Excite: 23%– Infoseek: 16%, Lycos: 4%

Page 22: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Estimating the size of the indexable web

• Overlap in coverage was used to estimate size.

A BU

U/B serves as an estimate of A/N, where N is the size of the Web.1998: Altavista/Hotbot estimate: 320 million pages.

Page 23: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Using size to refine coverage estimates.(1997)

• This value can then be used to determine a coverage estimate for each engine.

• For each pair, solve for N.

• Assume real N is largest found.

• Updated: HotBot: 34%, AltaVista: 28%

• NorthernLight: 20%, Excite: 14%

• Infoseek: 10%, Lycos: 3%

Page 24: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Updates: (1999)

• Web growth ahead of indexing– No search engine covers more than 16% of the

Web.– Union of all engines: ~50% coverage– Estimated size: 800 million pages– Search engines more likely to link to authorities– More likely to link to US, commercial sites.

Page 25: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Updates (12/2001)

• Self-reported number of pages indexed:

• Google: 2 billion (3 billion+ today)

• FAST (AllTheWeb.com): 625 million – (claimed 2.1 billion in 2002)

• Altavista: 550 million

• Inktomi: 500 million

• NorthernLight: 390 million

Page 26: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Indexing the web

• Spiders are used to crawl the web and collect pages.– A page is downloaded and its outward links are

found.– Each outward link is then downloaded.– Exceptions:

• Links from CGI interfaces

• Robot Exclusion Standard

Page 27: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Indexing the Web

• “Stop words” stripped from page

• Forward index created– Bundles words– Maps words to documents.

• Can use TFIDF to only map “significant” keywords– Term Frequency * InverseDocumentFrequency

Page 28: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Indexing the web

• An inverted index is created– Forward index sorted according to word – Maps keywords to URLs

• Some wrinkles:– Morphology: stripping suffixes (stemming),

singular vs. plural, tense, case folding– Semantic similarity

• Words with similar meanings share an index.

• Issue: trading coverage (number of hits) for precision (how closely hits match request)

Page 29: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Indexing Issues

• Indexing techniques were designed for static collections

• How to deal with pages that change?– Periodic crawls, rebuild index.

– Varied frequency crawls• Records need a way to be “purged”

• Hash of page stored

• Can use the text of a link to a page to help label that page.– Helps eliminate the addition of spurious keywords.

Page 30: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Indexing Issues

• Availability and speed– Most search engines will cache the page being

referenced.

• Multiple search terms – OR: separate searches concatenated– AND: intersection of searches computed.– Regular expressions not typically handled.

• Parsing– Must be able to handle malformed HTML, partial

documents

Page 31: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

PageRank

• Google uses PageRank to determine relevance.• Based on the “quality” of a page’s inward links.• Average the PageRanks of each page that points to

a given page, divided by their outdegree.

• Let p be a page, with T1 – Tn linking to p.

• PR(p) = (1-d) + d(SumI(Pr(TI)/outI))

• d is a ‘damping’ factor.• PR ‘propagates’ through a graph.

Page 32: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

PageRank

• Justification:– Imagine a random surfer who keeps clicking through

links.• d is the probability she starts a new search.

– Or … – A page has a high ranking if highly ranked pages point

to it.– Pros: difficult to game the system – Cons: Creates a “rich get richer” web structure where

highly popular sites grow in popularity.

Page 33: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

HITS

• HITS is also commonly used for document ranking.

• Gives each page a hub score and an authority score– A good authority is pointed to by many good

hubs.– A good hub points to many good authorities.– Users want good authorities.

Page 34: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Issues with Ranking Algorithms

• Spurious keywords and META tags

• Users reinforcing each other– Increases “authority” measure

• Topic drift– Many hubs link to more than one topic

Page 35: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Web structure

• Structure is important for:– Predicting traffic patterns

• Who will visit a site?

• Where will visitors arrive from?

• How many visitors can you expect?

– Estimating coverage• Is a site likely to be indexed?

Page 36: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Core

• Compact – Short paths between sites– “Small world” phenomenon

• Distances are small relative to average path length

– Number if inward and outward links follows a power law.

• Mechanism: preferential attachment– As new sites arrive, the probability of gaining

an inward link is proportional to in-degree.

Page 37: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Power laws and small worlds

• Power laws occur everywhere in nature– Distribution of site sizes, city sizes, incomes, word

frequencies– Random networks tend to evolve according to a power

law.

• Small-world phenomenon– “Neighborhoods” will be joined by a common member– Hubs serve to connect neighborhoods– Linkage is closer than one might expect– Six Degrees of Separation, Kevin Bacon

Page 38: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Local structure

• More diverse than a power law• Pages with similar topics self-organize into

communities– Short average path length– High link density– Webrings– Inverse: Does a high link density imply the existence of

a community?– Can this be used to study the emergence and growth of

web communities?

Page 39: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Hubs and Authorities

• Common community structure– Hubs

• Many outward links

• Lists of resources

– Authorities• Many inward links

• Provide resources, content

Page 40: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Hubs and Authorities

Hubs Authorities

Link structure estimates over 100,000 Web communitiesOften not categorized by portals

Page 41: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Web Communities

• Alternate definition– Each member has more links to community

members than non-community members.– Extension of a clique.– Can be discovered with network flow

algorithms.

Page 42: Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Weaknesses of search engines