C20.0046: Database Management Systems Lecture #27
description
Transcript of C20.0046: Database Management Systems Lecture #27
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
1
C20.0046: Database Management SystemsLecture #27
M.P. JohnsonStern School of Business, NYUSpring, 2005
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
2
Agenda Last time:
Data Mining
RAID
Websearch
Etc.
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
3
Goals after today:1. Understand what RAID is
2. Be able to perform RAID 4
3. Understand some issues in websearch
4. Be able to perform PageRank
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
4
New topic: RecoveryType of Crash Prevention
Wrong data entry Constraints andData cleaning
Disk crashes Redundancy: e.g. RAID, archive
Fire, theft, bankruptcy…
Buy insurance, Change jobs…
System failures:e.g. blackout
DATABASERECOVERY
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
5
System Failures (skip?) Each transaction has internal state
When system crashes, internal state is lost Don’t know which parts executed and which didn’t
Remedy: use a log A file that records each action of each xact Trail of breadcrumbs
See text for details…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
6
Media Failures Rule of thumb: Pr(hard drive has head crash
within 10 years) = 50% Simpler rule of thumb: Pr(hard drive has head
crash within 1 year) = (say) 10% If have many drives, then regular occurrence
Soln: different RAID strategies RAID: Redundant Arrays of Independent
Disks
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
7
RAID levels RAID level 1: each disk gets a mirror RAID level 4: one disk is xor of all others
Each bit is sum mod 2 of corresponding bits E.g.:
Disk 1: 11110000 Disk 2: 10101010 Disk 3: 00111000 Disk 4:
How to recover?
Various other RAID levels in text…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
8
RAID levels RAID level 1: each disk gets a mirror RAID level 4: one disk is xor of all others
Each bit is sum mod 2 of corresponding bits E.g.:
Disk 1: Disk 2: 10101010 Disk 3: 00111000 Disk 4:
How to recover?
Various other RAID levels in text…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
9
Next topic: Websearch Create a search engine for searching the web
DBMS queries use tables and (optionally) indices
First thing to understand about websearch: we never run queries on the web Way too expensive, for several reasons
Instead: Build an index of the web Search the index Return the results
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
10
Crawling To obtain the data for the index, we crawl the web
Automated web-surfing Conceptually very simple But difficult to do robustly
First, must get pages Prof. Davis (NYU/CS)’s example:
http://www.cs.nyu.edu/courses/fall02/G22.3033-008/WebCrawler.java http://pages.stern.nyu.edu/~mjohnson/dbms/eg/WebCrawler.java Rule of thumb: 1 page per minute Run program:
sales% cd ~mjohnson/public_html/dbms/egsales% java WebCrawler http://pages.stern.nyu.edu/~mjohnson/dbms 200
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
11
Crawling issues in practice DNS bottleneck
to view page by text link, must get address BP claim: 87% crawling time ~ DNS look-up
Search strategy?
Refresh strategy?
Primary key for webpages Use artificial IDs, not URLs more popular pages get shorter DocIDs (why?)
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
12
Crawling issues in practice Content-seen test
compute fingerprint/hash (again!) of page content
robots.txt http://www.robotstxt.org/wc/robots.html
Bad HTML Tolerant parsing
Non-responsive servers
Spurious text
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
13
Inverted indices Basic idea of finding pages:
Create inverted index mapping words to pages
First, think of each webpage as a tuple One column for each possible word True means the word appears on the page Index on all columns
Now can search: john bolton select * from T where john=T and bolton=T
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
14
Inverted indices Can simplify somewhat:
1. For each field index, delete False entries2. True entries for each index become a bucket
Create an inverted index: One entry for each search word
the lexicon Search word entry points to corresponding bucket Bucket points to pages with its word
the postings file
Final intuition: the inverted index doesn’t map URLs to words
It maps words to URLs
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
15
Inverted Indices What’s stored? For each word W, for each doc D
relevance of D to W #/% occurs. of W in D meta-data/context: bold, font size, title, etc.
In addition to page importance, keep in mind: this info is used to determine relevance of
particular words appearing on the page
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
16
Search engine infrastructure
Image from here: http://www.cs.wisc.edu/~dbbook/openAccess/thirdEdition/slides/slides3ed-english/Ch27c_ir3-websearch-95.pdf
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
17
Google-like infrastructure Very large distributed system
File sizes routines in GBs Google File System Block size = 64MB (not kb)!
100k+ low-quality Linux boxes system failures are the rule, not exception
Divide index up by words into many barrels lexicon maps word ids to word’s barrel also, do RAID-like stragegy two-D matrix of servers
many commodity machines frequent crashes Draw picture May have more duplication for popular pages…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
18
Google-like infrastructure To respond to single-word query Q(w):
send to the barrel column for word w pick random server in that column
return (some) sorted results
To respond to multi-word query Q(w1…wn): for each word wi, send to the barrel column for wi
pick random server in that column for all words in parallel, merge and prune
step through until find doc containing all words, add to results index ordered on word;docID, so linear time
return (some) sorted results
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
19
Websearch v. DBMS
DBMS IR/websearch
precise query language KISS
SQL selects Keyword search
Relational schemas Loosely structured
Generate full answer Display first/next k results
Read/write Read-mostly
Commits immediately Commits eventually
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
20
New topic: Sorting Results How to respond to Q(w1,w2,…,wn)?
Search index for pages with w1,w2,…,wn Return in sorted order (how?)
Soln 1: current order Return 100,000 (mostly) useless results
Sturgeon's Law: “Ninety percent of everything is crud.”
Soln 2: ways from Information Retrieval Theory library science + CS = IR
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
21
Simple IR-style approach for each word W in a doc D, compute
# occurs of W in D / total # word occurs in D each document becomes a point in a space
one dimension for every possible word Like k-NN and k-means
value in that dim is ratio from above (maybe weighted, etc.) Choose pages with high values for query words
A little more precisely: each doc becomes a vector in space Values same as above But: think of the query itself as a document vector Similarity between query and doc = dot product / cos Draw picture
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
22
Information Retrieval Theory With some extensions, this works well for
relatively small sets of quality documents
But the web has 8 billion documents Problem: if based just on percentages, very short
pages containing query words score very high BP: query a “major search engine” for “bill clinton” “Bill Clinton Sucks” page
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
23
Soln 3: sort by “quality” What do you mean by quality?
Hire readers to rate my webpage (early Yahoo)
Problem: doesn’t scale well more webpages than Yahoo employees…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
24
Soln 4: count # citations (links) Idea: you don’t have to hire webpage raters
The rest of the web has already voted on the quality of my webpage
1 link to my page = 1 vote
Similar to counting academic citations Peer review
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
25
Soln 5: Google’s PageRank Count citations, but not equally – weighted sum Motiv: we said we believe that some pages are
better than others those pages’ votes should count for more
A page can get a high PageRank many ways Two cases at ends of a continuum:
many pages link to you yahoo.com links to you
PageRank, not PigeonRank Search for “PigeonRank”…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
26
PageRank More precisely, let P be a page; for each page Li that links to P, let C(Li) be the number of pages Li links to.
Then PR0(P) = SUM(PR0(Li)/C(Li)))
Motiv: each page votes with its quality; its quality is divided among the pages it votes for Extensions: bold/large type/etc. links may get
larger proportions…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
27
Understanding PageRank (skip?) Analogy 1: Friendster/Orkut
someone “good” invites you in someone else “good” invited that person in, etc.
Analogy 2: PKE certificates my cert authenticated by your cert your cert endorsed by someone else's…
Both cases here: eventually reach a foundation
Analogy 3: job/school recommendations three people recommend you why should anyone believe them?
three other people rec-ed them, etc. eventually, we take a leap of faith
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
28
Understanding PageRank Analogy 4: Random Surfer Model
Idealized web surfer: First, start at some page Then, at each page, pick a random link…
Turns out: after long time surfing, Pr(were at some page P right now) = PR0(P) PRs are normalized
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
29
Computing PageRank For each page P, we want:
PR(P) = SUM(PR(Li)/C(Li))) But its circular – how to compute?
Meth 1: for n pages, we've got n linear eqs and n unknowns can solve for all PR(P)s, but too hard see your linear algebra course…
Meth 2: iteratively start with PR0(P) set to E for each P iterate until no more significant change PB report O(50) iterations for O(30M) pages/O(300M) links
#iters req. grows only with log of web size
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
30
Problems with PageRank Example (from Ullman):
A points to Y, M; Y points to self, A; M points nowhere draw picture
Start A,Y,M at 1:
(1,1,1) (0,0,0) The rank dissipates
Soln: add (implicit) self link to any dead-end
sales% cd ~mjohnson/public_html/dbms/egstern% java PageRank
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
31
Problems with PageRank Example (from Ullman):
A points to Y, M; Y points to self, A; M points to self
Start A,Y,M at 1:
(1,1,1) (0,0,3) Now M becomes a rank sink RSM interp: we eventually end up at M and then get stuck
Soln: add “inherent quality” E to each page
stern% java PageRank2
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
32
Modified PageRank Apart from inherited quality, each page also
has inherent quality E: PR(P) = E + SUM(PR(Li)/C(Li)))
More precisely, have weighted sum of the two terms: PR(P) = .15*E + .85*SUM(PR(Li)/C(Li)))
Leads to a modified random surfer model
stern% java PageRank3
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
33
Random Surfer Model’ Motiv: if we (qua random surfer) end up at page M,
we don’t really stay there forever We type in a new URL
Idealized web surfer: First, start at some page Then, at each page, pick a random link But occasionally, we get bored and jump to a random new
page
Turns out: after long time surfing, Pr(we’re at some page P right now) = PR(P)
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
34
Understanding PageRank One more interp: hydraulic model
picture the web graph again imagine each link as a tube bet. two nodes imagine quality as fluid each node is a reservoir initialized with amount E of fluid
Now let flow…
Steady state is: each node P w/PR(P) amount of fluid PR(P) of fluid eventually settles in node P equilibrium
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
35
Somewhat analogous systems (skip?) Sornette: “Why Stock Markets Crash”
Si(t+1) = sign(ei + SUM(Sj(t)) trader buys/sells based on1. is inclination and2. what is associates are saying
directions. of magnet det-ed by1. old direction and2. dirs. of neighbors
activation of neuron det-ed by1. its props and2. activation of neighbors connected by synapses
PR of P based on1. its inherent value and2. PR of in-links
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
36
Non-uniform Es (skip?) So far, assumed E was const for all pages But can make E a function E(P)
vary by page
How do we choose E(P)? Idea 1: set high for pages with high PR from earlier
iterations Idea 2: set high for pages I like
BP paper gave high E to John McCarthy’s homepage pages he links to get high PR, etc. Result: his own personalized search engine Q: How would google.com get your prefs?
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
37
Tricking search engines “Search Engine Optimization”
Challenge: include on your page lots of words you think people will query on maybe hidden with same color as background
Response: popularity ranking the pages doing this probably aren't linked to that
much but…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
38
Tricking search engines I can try to make my page look popular to the search
engine Challenge: create a page with 1000 links to my page
does this work?
Challenge: Create 1000 other pages linking to it Response: limit the weight a single domain can give
to itself
Challenge: buy a second domain and put the 1000 pages there
Response: limit the weight from any single domain…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
39
Using anchor text Another good idea: use anchor text
Motiv: pages may not give best descrips. of themselves most search engines don’t contain "search engine" BP claim: only 1 of 4 “top search engines” could find
themselves on query "search engine"
Anchor text also describes page: many pages link to google.com many of them likely say "search engine" in/near the link Treat anchor text words as part of page
Search for “US West” or for “g++”
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
40
Tricking search engines This provides a new way to trick the search engine Use of anchor text is a big part of result quality
but has potential for abuse Lets you influence the appearance of other people’s pages
Google Bombs put up lots of pages linking to my page, using some
particular phrase in the anchor text result: search for words you chose produces my page Examples: "talentless hack", "miserable failure", “waffles",
the last name of a prominent US senator…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
41
Bidding for ads Google had two really great ideas:
1. PageRank2. AdWords/AdSense
Fundamental difficulty with mass-advertising: Most of the audience does want it Most people don’t want what you’re selling Think of car commercials on TV
But some of them do!
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
42
Bidding for ads If you’re selling widgets, how do you know
who wants them? Hard question, so answer its inversion
If someone is searching for widgets, what should you try to sell them? Easy – widgets!
Whatever the user searches for, display ads relevant to that query
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
43
Bidding for ads Q: How to divvy correspondences up? A: Create a market, and let the divvying take care of
itself
Each company places the bid it’s willing to pay for an ad responding to a particular query
Ad auction “takes place” at query-time Relevant ads displayed in descending bid order Company pays only if user clicks
AdSense: place ads on external webpages, auction based on page content instead of query
Huge huge huge business
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
44
Click Fraud The latest challenge: Users who click on ad
links to cost their competitors money Or pay Indian housewives
$.25/click
http://online.wsj.com/public/article/0,,SB111275037030799121-k_SZdfSzVxCwQL4r9ep_KgUWBE8_20050506,00.html?mod=tff_article
http://timesofindia.indiatimes.com/articleshow/msid-654822,curpg-1.cms
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
45
For more info See sources drawn upon here: Prof. Davis (NYU/CS) search engines course
http://www.cs.nyu.edu/courses/fall02/G22.3033-008/
Original research papers by Page & Brin: The PageRank Citation Ranking: Bringing Order to the Web The Anatomy of a Large-Scale Hypertextual Web Search
Engine Links on class page Interesting and very accessible
Google Labs: http://labs.google.com
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
46
You mean that’s it? Final Exam: next Thursday, 5/5,10-11:50am
Final exam info is up
Course grades are cuvered
Interest in a review session?
Please fill out course evals! https://ais.stern.nyu.edu/ Comments by email, etc., are welcome
Thanks!