Web characteristics, broadly defined Thanks to C. Manning, P. Raghavan, H. Schutze.
-
Upload
vincent-wiggins -
Category
Documents
-
view
218 -
download
0
Transcript of Web characteristics, broadly defined Thanks to C. Manning, P. Raghavan, H. Schutze.
Web characteristics, broadly defined
Thanks to C. Manning, P. Raghavan, H. Schutze
What we have covered What is IR Evaluation Tokenization and properties of text Web crawling Vector methods Measures of similarity Indexing Inverted files Today:
Web characteristics Web vs classic IR Web advertising Web as a graph – size SEO Web spam Dup detection
Interface
Query Engine
Indexer
Index
Crawler
Users
Web
A Typical Web Search Engine
Characteristics
What Is the World Wide Web? The world wide web (web) is a network of information resources,
most of which is published by humans.– World’s largest publishing mechanism
The web relies on three mechanisms to make these resources readily available to the widest possible audience:
– A uniform naming scheme for locating resources on the web (e.g., URIs).
• Find it– Protocols, for access to named resources over the web (e.g., HTTP).
• How to get there– Hypertext, for easy navigation among resources (e.g., HTML).
• Where to go from there. Web is opt-out not an opt-in system
– Your information is available to all; you have to protect it.– DMCA (Digital Millennium Copyright Act)
– Only to those who sign it.
Internet vs. Web Internet:• Internet is a more general term • Includes physical aspect of underlying networks
and mechanisms such as email, FTP, HTTP… Web:• HTTP• Associated with information stored on the
Internet• Refers to a broader class of networks, i.e. Web
of English Literature Both Internet and web are networks
Networks vsGraphs
Examples?
Old internet network
Internet TechnologiesWeb Standards
Internet Engineering Task Force (IETF) http://www.ietf.org/ Founded 1986 Request For Comments (RFC) at
http://www.ietf.org/rfc.html World Wide Web Consortium (W3C)
http://www.w3.org Founded 1994 by Tim Berners-Lee Publishes technical reports and recommendations
Internet TechnologiesWeb Design Principles
Interoperability: Web languages and protocols must be compatible with one another independent of hardware and software.
Evolution: The Web must be able to accommodate future technologies.
Encourages simplicity, modularity and extensibility.
Decentralization: Facilitates scalability and robustness.
Languages of the WWW
Markup languages A markup language combines text and extra
information about the text. The extra information, for example about the text's structure or presentation, is expressed using markup, which is intermingled with the primary text.
The best-known markup language is in modern use is HTML (Hypertext Markup Language), one of the foundations of the World Wide Web. Historically, markup was (and is) used in the publishing industry in the communication of printed work between authors, editors, and printers.
Without search engines the web wouldn’t scale (there would be no web)
1. No incentive in creating content unless it can be easily found – other finding methods haven’t kept pace (taxonomies, bookmarks, etc)
2. The web is both a technology artifact and a social environment “The Web has become the “new normal” in the American way
of life; those who don’t go online constitute an ever-shrinking minority.” – [Pew Foundation report, January 2005]
3. Search engines make aggregation of interest possible: Create incentives for very specialized niche players
Economical – specialized stores, providers, etc Social – narrow interests, specialized communities, etc
4. The acceptance of search interaction makes “unlimited selection” stores possible:
Amazon, Netflix, etc5. Search turned out to be the best mechanism for advertising on
the web, a $15+ B industry (2011) Growing very fast but entire US advertising industry $250B –
huge room to grow Sponsored search marketing is about $10B
Classical IR vs. Web IR
Basic assumptions of Classical Information Retrieval
Corpus: Fixed document collection Goal: Retrieve documents with information
content that is relevant to user’s information need Searcher: information scientist or a search
professional trained in making logical queries
Classic IR GoalClassic relevance For each query Q and stored document D in a given
corpus assume there exists relevance Score(Q, D) Score is average over users U and context C
Optimize Score(Q, D) as opposed to Score(Q, D, U, C) That is, usually:
Context ignored Individuals ignored Corpus predetermined
Bad assumptionsin the web context
Web Information Retrieval
Basic assumptions of Web Information Retrieval
Corpus: constantly changing; created by amateurs and professionals
Goal: Retrieve summaries of relevant information quickly with links to the original site High precision! Recall not important
Searcher: amateurs; no professional training and less or no concern about quality queries
The coarse-level dynamics
Content creators Content aggregators
Feeds
Crawls
Content consumers
Adv
erti
sem
ent
Edi
tori
al
Sub
scri
ptio
nT
rans
acti
on
Brief (non-technical) history
Early keyword-based engines Altavista, Excite, Infoseek, Inktomi, ca. 1995-1997
Paid placement ranking: Goto.com (morphed into Overture.com Yahoo!) Your search ranking depended on how much you
paid Auction for keywords: casino was expensive!
Brief (non-technical) history
1998+: Link-based ranking pioneered by Google Blew away all early engines save Inktomi Great user experience in search of a business
model Meanwhile Goto/Overture’s annual revenues were
nearing $1 billion Result: Google added paid-placement “ads” to
the side, independent of search results Yahoo follows suit, acquiring Overture (for paid
placement) and Inktomi (for search)
Ads
Algorithmic results.
Query Results 2009
Ads
Algorithmic results.
Query Results 2010
Query Results 2013
Algorithmic results.
Query Results 2014
Algorithmic results.
Algorithmic results.
Query String for Google 2013
http://www.google.com/search?q=lee+giles
“?” in query implies the start of a query componentquery equals lee AND giles
Basic query reduced from the full one.
Query above is lee giles
What does Bing do?duckduckgo?
• For more, see wikipedia query string
Ads vs. search results
Google has maintained that ads (based on vendors bidding for keywords) do not affect vendors’ rankings in search results
Web Results 1 - 10 of about 7,310,000 for miele . (0.12 seconds)
Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele . ... USA. to miele .com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www. miele .com/ - 20k - Cached - Similar pages
Miele Welcome to Miele , the home of the very best appliances and kitchens in the world. www. miele .co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu -tisch.d e. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www. miele .de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www. miele .at/ - 3k - Cached - Similar pages
Sponsored Links
CG Appliance Express Discount Appliances (650) 756 -3931 Same Day Certified Installation www.cgappliance.com San Francis co-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums - Complete Sele ction Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele -Free Air shipping! All models. Helpful advice. www.best -vacuum.com
Search =miele
Google has maintained that ads (based on vendors bidding for keywords) do not affect vendors’ rankings in search results
Ads vs. search results
Other vendors (Yahoo, Bing) have made similar statements from time to time Any of them can change anytime
Focus primarily on search results independent of paid placement ads Although the latter is a fascinating
technical subject in itself
Pay Per Click (PPC) Search Engine Ranking
PPC ads appear as “sponsored listings” Companies bid on price they are willing to pay “per click” Typically have very good tracking tools and statistics Ability to control ad text Can set budgets and spending limits Google AdWords and Overture are the two leaders
Web search basics
The Web
Ad indexes
Web Results 1 - 10 of about 7,310,000 for miele . (0.12 seconds)
Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele . ... USA. to miele .com. Residential Appliances. Vacuum C leaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www. miele .com/ - 20k - Cached - Similar pages
Miele Welcome to Miele , the home of the very best appliances and kitchens in the world. www. miele .co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu -tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www. miele .de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www. miele .at/ - 3k - Cached - Similar pages
Sponsored Links
CG Appliance Express Discount Appliances (650) 756 -3931 Same Day Certified Installation www.cgappliance.com San Francisco -Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums - Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele -Free Air shipping! All models. Helpful advice. www.b est -vacuum.com
Web spider
Indexer
Indexes
Search
User
User Needs Need [Brod02, RL04]
Informational – want to learn about something (~40% / 65%)
Navigational – want to go to that page (~25% / 15%)
Transactional – want to do something (web-mediated) (~35% / 20%) Access a service
Downloads
Shop Gray areas
Find a good hub Exploratory search “see what’s there”
Low hemoglobin
United Airlines
Seattle weatherMars surface images
Canon S410
Car rental Brasil
Web search users
Make ill defined queries
Short AV 2001: 2.54 terms
avg, 80% < 3 words) AV 1998: 2.35 terms
avg, 88% < 3 words [Silv98]
Imprecise terms Sub-optimal syntax (most
queries without operator) Low effort
Wide variance in Needs Expectations Knowledge Bandwidth
Specific behavior 85% look over one
result screen only (mostly above the fold)
78% of queries are not modified (one query/session)
Follow links – “the scent of
information” ...
Query Distribution
Power law: few popular broad queries, many rare specific queries
How far do people look for results?
(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)
Users’ empirical evaluation of results Quality of pages varies widely
Relevance is not enough Other desirable qualities (non IR!!)
Content: Trustworthy, new info, non-duplicates, well maintained, Web readability: display correctly & fast No annoyances: pop-ups, etc
Precision vs. recall On the web, recall seldom matters
What matters Precision at 1? Precision above the fold? Comprehensiveness – must be able to deal with obscure
queries Recall matters when the number of matches is very small
User perceptions may be unscientific, but are significant over a large aggregate
Users’ empirical evaluation of engines
Relevance and validity of results UI – Simple, no clutter, error tolerant Trust – Results are objective Coverage of topics for polysemic queries
Eg. mole, bank Pre/Post process tools provided
Mitigate user errors (auto spell check, syntax errors,…) Explicit: Search within results, more like this, refine ... Anticipative: related searches
Deal with idiosyncrasies Web specific vocabulary
Impact on stemming, spell-check, etc Web addresses typed in the search box …
The Web corpusLarge dynamic directed graph No design/co-ordination Distributed content creation, linking,
democratization of publishing Content includes truth, lies, obsolete
information, contradictions … Unstructured (text, html, …), semi-
structured (XML, annotated photos), structured (Databases)…
Scale much larger than previous text corpora … but corporate records are catching up.
Growth – slowed down from initial “volume doubling every few months” but still expanding
Content can be dynamically generatedThe Web
Web Today The Web consists of hundreds of billions of
pages (Google query ‘the’) Potentially infinite if dynamic pages are considered
It is considered one of the biggest information revolutions in recent human history
One of the largest graphs around Full of information
Trends
Web page
The simple Web graph A graph G = (V, E) is defined by
a set V of vertices (nodes) a set E of edges (links) = pairs of nodes
The Web page graph (directed) V is the set of static public pages E is the set of static hyperlinks
Many more graphs can be defined The host graph The co-citation graph Temporal graph etc
Which pages do we care for if we want to measure the web graph?
Avoid “dynamic” pages? catalogs pages generated by queries pages generated by cgi-scripts (the nostradamus
effect)
Only interested in “static” web pages
The Web: Dynamic content
A page without a static html version E.g., current status of flight AA129 Current availability of rooms at a hotel
Usually, assembled at the time of a request from a browser Sometimes, URL has a ‘?’ character in it ‘?’ precedes the actual query
Application server
Browser
AA129
Back-enddatabases
The Static Public Web
Static not the result of a cgi-
bin scripts no “?” in the URL doesn’t change very
often etc.
Public no password
required no robots.txt
exclusion no “noindex” meta
tag etc. These rules can still be fooled
“Dynamic pages” appear static• browseable catalogs (Hierarchy built from DB)
Spider traps -- infinite url descent• www.x.com/home/home/home/…./home/home.html
Spammer games
Example - http:/clgiles.ist.psu.edu
Why do we care about the Web graph?
Is it the largest human artifact ever created? Exploit the Web structure for
crawlers search and link analysis ranking spam detection community discovery classification/organization business, politics, society applications
Predict the Web future mathematical models algorithm analysis sociological understanding New business opportunities New politics
The first question: what is the size of the Web?
Surprisingly hard to answer
Naïve solution: keep crawling until the whole graph has been explored
Extremely simple but wrong solution: crawling is complicated because the web is complicated
spamming duplicates mirrors
Simple example of a complication: Soft 404 When a page does not exists, the server is supposed to return an
error code = “404” Many servers do not return an error code, but keep the visitor on
site, or simply send him to the home page
A sampling approach
Sample pages uniformly at random Compute the percentage of the pages that
belong to a search engine repository (search engine coverage)
Estimate the size of the Web
Problems: how do you sample a page uniformly at random? how do you test if a page is indexed by a search
engine?
Sampling pages [LG99, HHMN00] Create IP addresses uniformly at random
problems with virtual hosting, spamming
Starting from a subset of pages perform a random walk on the graph. After “enough” steps you should end up in a random page. near uniform sampling
Testing search engine containment [BB98]
Measuring the Web
It is clear that the Web that we see is what the crawler discovers
We need large crawls in order to make meaningful measurements
The measurements are still biased by the crawling policy size limitations of the crawl Perturbations of the "natural" process of birth and
death of nodes and links
Measures on the Web graph [BKMRRSTW00]
Degree distributions The global picture
what does the Web look like from a far? Reachability Connected components Community structure The finer picture
In-degree distribution
Power-law distribution with exponent 2.1
Out-degree distribution
Power-law distribution with exponent 2.7
The good news
The fact that the exponent is greater than 2 implies that the expected value of the degree is a constant (not growing with n)
Therefore, the expected number of edges is linear in the number of nodes n
This is good news, since we cannot handle anything much more than linear
Connected components – definitions Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an undirected path
Strongly connected components (SCC) Set of nodes such that from any node can go to any node via a
directed path.
SCCSCC
WCCWCC
The bow-tie structure of the Web
SCC and WCC distribution
The SCC and WCC sizes follows a power law distribution the second largest SCC is significantly smaller
The inner structure of the bow-tie [LMST05]
What do the individual components of the bow tie look like? They obey the same power laws in the degree
distributions
The inner structure of the bow-tie
Is it the case that the bow-tie repeats itself in each of the components (self-similarity)? It would look nice, but this does not seem to be
the case no large WCC, many small ones
The daisy structure?
Large connected core, and highly fragmented IN and OUT components
Unfortunately, we do not have a large crawl to verify this hypothesis
A different kind of self-similarity [DKCRST01]
Consider Thematically Unified Clusters (TUC): pages grouped by keyword searches web location (intranets) geography hostgraph random collections
All such TUCs exhibit a bow-tie structure!
Self-similarity
The Web consists of a collection of self-similar structures that form a backbone of the SCC
Dynamic content Most dynamic content is ignored by web spiders
Many reasons including malicious spider traps Some dynamic content (news stories from
subscriptions) are sometimes delivered as dynamic content
Application-specific spidering Spiders commonly view web pages just as Lynx (a text
browser) would Note: even “static” pages are typically assembled on
the fly (e.g., headers are common)
The web: size
What is being measured? Number of hosts Number of (static) html pages
Volume of data
Number of hosts – netcraft survey http://news.netcraft.com/archives/web_server_survey.html Monthly report on how many web hosts & servers are out
there
Number of pages – numerous estimates
Netcraft Web Server Surveyhttp://news.netcraft.com/archives/web_server_survey.html
Netcraft Web Server Surveyhttp://news.netcraft.com/archives/web_server_survey.html
Netcraft Web Server Surveyhttp://news.netcraft.com/archives/web_server_survey.html
The web: evolution
All of these numbers keep changing Relatively few scientific studies of the
evolution of the web [Fetterly & al, 2003] http://research.microsoft.com/research/sv/sv-
pubs/p97-fetterly/p97-fetterly.pdf
Sometimes possible to extrapolate from small samples (fractal models) [Dill & al, 2001] http://www.vldb.org/conf/2001/P069.pdf
Rate of change
[Cho00] 720K pages from 270 popular sites sampled daily from Feb 17 – Jun 14, 1999
Any changes: 40% weekly, 23% daily [Fett02] Massive study 151M pages checked over
few months Significant changed -- 7% weekly Small changes – 25% weekly
[Ntul04] 154 large sites re-crawled from scratch weekly
8% new pages/week 8% die 5% new content 25% new links/week
Static pages: rate of change
Fetterly et al. study (2002): several views of data, 150 million pages over 11 weekly crawls
Bucketed into 85 groups by extent of change
Other characteristics Significant duplication
Syntactic – 30%-40% (near) duplicates [Brod97, Shiv99b, etc.]
Semantic – ??? High linkage
More than 8 links/page in the average Complex graph topology
Not a small world; bow-tie structure [Brod00] Spam
Billions of pages
Spamvs
Search Engine Optimization (SEO)
SERPs
• Search engine results page (SERP) are very influential
• How can they be manipulated
• How can you come up first?
Google SERP forquery “lee giles’
The trouble with paid placement…
It costs money. What’s the alternative? Search Engine Optimization (SEO):
“Tuning” your web page to rank highly in the search results for select keywords
Alternative to paying for placement Thus, intrinsically a marketing function
Performed by companies, webmasters and consultants (“Search engine optimizers”) for their clients
Some perfectly legitimate, some very shady Some frowned upon by Search Engines
Simplest forms
First generation engines relied heavily on tf/idf The top-ranked pages for the query maui resort were the
ones containing the most maui’s and resort’s SEOs responded with dense repetitions of chosen
terms e.g., maui resort maui resort maui resort Often, the repetitions would be in the same color as the
background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers
Pure word density cannot be trusted as an IR signal
Search Engine Spam: Objective
Success of commercial Web sites depends on the number of visitors that find the site while searching for a particular product. 85% of searchers look at only the first page of resultsA new business sector – search engine optimization
M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. International Joint Conference on Artificial Intelligence, 2003.Drost, I. and Scheffer, T., Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam. 16th European Conference on Machine Learning, Porto, 2005
What’s SEO? SEO = Search Engine Optimization
Refers to the process of “optimizing” both the on-page and off-page ranking factors in order to achieve high search engine rankings for targeted search terms.
Refers to the “industry” that revolves around obtaining high rankings in the search engines for desirable keyword search terms as a means of increasing the relevant traffic to a given website.
Refers to an individual or company that optimizes websites for its clientele.
Has a number of related meanings, and usually refers to an individual/firm that focuses on optimizing for “organic” search engine rankings
What’s SEO based on
Features we know used in web page ranking Page content Page metadata Anchor text Links User behavior Others?
Search Engine Spam: Spamdexing
spamdexing (also known as search spam, search engine spam, web spam or search engine poisoning) is the deliberate manipulation of search engine indexes
aka: black hat SEO
Porn led the way – ‘96
Search Engine Spamdexing Methods
Content based
Link spam
Cloaking
Mirror sites
URL redirection
Content spamming
Keyword stuffing•calculated placement of keywords within a page to raise the keyword count, variety, and density of the page. Truncation so that massive dictionary lists cannot be indexed on a single webpage.•Hidden or invisible text
Meta-tag stuffing•Out of date
Doorway pages•"Gateway" or doorway pages created with very little content but are instead stuffed with very similar keywords and phrases. •BMW caught
Scraper sites•amalgamation of content taken from other sites; still works
Article spinning•rewriting existing articles, as opposed to merely scraping content from other sites, undertaken by hired writers or automated using a thesaurus database or a neural network.
Variants of keyword stuffing
Misleading meta-tags, excessive repetition Hidden text with colors, style sheet tricks,
etc.
Meta-Tags = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, …”
Link Spamming: TechniquesLink farms: Densely connected arrays of pages. Farm pages propagate their PageRank to the target, e.g., by a funnel-shaped architecture that points directly or indirectly towards the target page. To camouflage link farms, tools fill in inconspicuous content, e.g., by copying news bulletins.
Link exchange services: Listings of (often unrelated) hyperlinks. To be listed, businesses have to provide a back link that enhances the PageRank of the exchange service.
Guestbooks, discussion boards, and weblogs: Automatic tools post large numbers of messages to many sites; each message contains a hyperlink to the target website.
Cloaking
Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate
Is this a SearchEngine spider?
Y
N
SPAM
RealDocCloaking
89
Google Bombing != Google Hacking http://en.wikipedia.org/wiki/Google_bomb A Google bomb or Google wash is an attempt to
influence the ranking of a given site in results returned by the Google search engine. Due to the way that Google's Page Rank algorithm works, a website will be ranked higher if the sites that link to that page all use consistent anchor text.
A google bomb is when links are put in several sites on the internet by a number of people so it leads a particular keyword search combination to a specific site and the site is deluged/swamped (it may crash the site) with hits.
Google Bomb - old
Others?
Query: french military victories
Link Spamming: Defenses
Manual identification of spam pages and farms to create a blacklist.
Automatic classification of pages using machine learning techniques.
BadRank algorithm. The "bad rank" is initialized to a high value for blacklisted pages. It propagates bad rank to all referring pages (with a damping factor) thus penalizing pages that refer to spam.
Intelligent SEO
Figure out how search engines do their ranking Inductive science
Make intelligent changes Figure our what happens Repeat
So What is the Search Engine Ranking Algorithm?
Top Secret! Only select employees of the actual search engines know for certain
Reverse engineering, research and experiments gives some idea of major factors and approximate weight assignments
Constant changing, tweaking, updating is done to the algorithm
Websites and documents being searched are also constantly changing
Varies by Search Engine – some give more weight to on-page factors, some to link popularity
SEO Expert
Search engine optimization vs Spam Motives
Commercial, political, religious, lobbies Promotion funded by advertising budget
Operators Contractors (Search Engine Optimizers) for lobbies, companies Web masters Hosting services
Forums E.g., Web master world ( www.webmasterworld.com )
Search engine specific tricks Discussions about academic papers
The spam industry
SEO contests
•Now part of some search engine classes!
The war against spam Quality signals - Prefer
authoritative pages based on:
Votes from authors (linkage signals)
Votes from users (usage signals)
Policing of URL submissions
Anti robot test Limits on meta-keywords Robust link analysis
Ignore statistically implausible linkage (or text)
Use link analysis to detect spammers (guilt by association)
Spam recognition by machine learning
Training set based on known spam
Family friendly filters Linguistic analysis, general
classification techniques, etc.
For images: flesh tone detectors, source text analysis, etc.
Editorial intervention Blacklists Top queries audited Complaints addressed Suspect pattern detection
The war against spam – what Google does
Google changes their algorithms regularly Panda and Penguin updates
Uses machine learning and AI methods SEO doesn’t seem to use this
Suing Google doesn’t work Google Wins in SearchKing Lawsuit
Defines what counts as a high quality site 23 features
Google penalty refers to a negative impact on a website's search ranking
based on updates to Google's search algorithms. penalty can be an unfortunate by-product an intentional penalization of various black-hat SEO
techniques
Future of spamdexing
Who’s the smartest tech wise: Search engines vs SEOs Constant evolution Can it ever be solved?
Web search engines have policies on SEO practices they tolerate/block
http://www.bing.com/toolbox/webmaster http://www.google.com/intl/en/webmasters/
Adversarial IR: the unending (technical) battle between SEO’s and web search engines Research http://airweb.cse.lehigh.edu/
Many SEO companies will suffer
Reported “Organic” (white hat) Optimization Techniques
Register with search engines Research keywords related to your business Identify competitors, utilize benchmarking techniques and identify level of
competition Utilize descriptive title tags for each page Ensure that your text is HTML text and not image text Use text links when possible Use appropriate keywords in your content and internal hyperlinks (don’t
overdo!) Obtain inbound links from related websites Use Sitemaps Monitor your search engine rankings and more importantly your website
traffic statistics and sales/leads produced Educate yourself about search engine marketing or consult a search
engine optimization firm or SEO expert Use the Google Guide to High Quality Websites
Duplicate detection
Get rid of duplicates; save space and timeProduct or idea evolution (near duplicates)Check for stolen information, plagiarism, etc.
Duplicate documents
The web is full of duplicated content Estimates at 40%
Strict duplicate detection = exact match Not as common
But many, many cases of near duplicates E.g., Last modified date the only difference
between two copies of a page
Duplicate/Near-Duplicate Detection
Duplication: Exact match can be detected with fingerprints Hash value
Near-Duplication: Approximate match Overview
Compute syntactic similarity with an edit-distance measure
Use similarity threshold to detect near-duplicates E.g., Similarity > 80% => Documents are “near duplicates” Not transitive though sometimes used transitively
Computing Similarity Features:
Segments of a document (natural or artificial breakpoints) Shingles (Word N-Grams) a rose is a rose is a rose →
a_rose_is_a
rose_is_a_rose
is_a_rose_is
a_rose_is_a Similarity Measure between two docs (= sets of shingles)
Set intersection Specifically (Size_of_Intersection / Size_of_Union)
Shingles + Set Intersection
Computing exact set intersection of shingles between all pairs of documents is expensive/intractable Approximate using a cleverly chosen subset of
shingles from each (a sketch) Estimate (size_of_intersection / size_of_union) based on a short sketch
Doc A
Doc A
Shingle set A Sketch A
Doc B
Doc B
Shingle set A Sketch A
Jaccard
Sketch of a document
Create a “sketch vector” (of size ~200) for each document Documents that share ≥ t (say 80%)
corresponding vector elements are near duplicates
For doc D, sketchD[ i ] is as follows: Let f map all shingles in the universe to 0..2m
(e.g., f = fingerprinting) Let i be a random permutation on 0..2m
Pick MIN {i(f(s))} over all shingles s in D
Comparing Signatures
Signature Matrix S Rows = Hash Functions Columns = Columns Entries = Signatures
Can compute – Pair-wise similarity of any pair of signature columns
All signature pairs
Now we have an extremely efficient method for estimating a Jaccard coefficient for a single pair of documents.
But we still have to estimate N2 coefficients where N is the number of web pages. Still slow
One solution: locality sensitive hashing (LSH) Another solution: sorting (Henzinger 2006)
What we covered Web vs classic IR What users do Web advertising Web as a graph
Applications Structure and size
SEO/Web spam White vs black hat
Dup detection