Search Technologies
description
Transcript of Search Technologies
Search Technologies
Examples
• Fast Google Enterprise– Google Search Solutions for business– Page Rank
• Lucene– Apache Lucene is a high-performance, full-featured text
search engine library written entirely in Java• Solr– Solr is the popular, blazing fast open source enterprise
search platform from the Apache Lucene project.
Search Engine Ranking Criteria
Yahoo!
• been in the search game for many years.• is better than MSN but nowhere near as good as
Google at determining if a link is a natural citation or not.
• has a ton of internal content and a paid inclusion program, both of which give them incentive to bias search results toward commercial results
• things like cheesy off topic reciprocal links still work great in Yahoo!
MSN (bing)• new to the search game• is bad at determining if a link is natural or artificial in nature• due to sucking at link analysis they place too much weight on
the page content• their poor relevancy algorithms cause a heavy bias toward
commercial results• likes bursty recent links• new sites that are generally untrusted in other systems can
rank quickly in MSN Search• things like cheesy off topic reciprocal links still work great in
MSN Search
Google• has been in the search game a long time, and saw the web graph when it is much
cleaner than the current web graph• is much better than the other engines at determining if a link is a true editorial
citation or an artificial link• looks for natural link growth over time• heavily biases search results toward informational resources• trusts old sites way too much• a page on a site or subdomain of a site with significant age or link related trust can
rank much better than it should, even with no external citations• they have aggressive duplicate content filters that filter out many pages with similar
content• if a page is obviously focused on a term they may filter the document out for that
term. on page variation and link anchor text variation are important. a page with a single reference or a few references of a modifier will frequently outrank pages that are heavily focused on a search phrase containing that modifier
• crawl depth determined not only by link quantity, but also link quality. Excessive low quality links may make your site less likely to be crawled deep or even included in the index.
• things like cheesy off topic reciprocal links are generally ineffective in Google when you consider the associated opportunity cost
Ask
• looks at topical communities• due to their heavy emphasis on topical
communities they are slow to rank sites until they are heavily cited from within their topical community
• due to their limited market share they probably are not worth paying much attention to unless you are in a vertical where they have a strong brand that drives significant search traffic
History
• SMART– Salton’s Magic Information Retrieval of Text– Vector Space Model– Relevance feedback algorithm (customization)– Latent Semantic Indexing (LSI)
Basic Vector Space Algo
• Vanilla Search Algo• Key word search (ignore search modifiers e.g.
not, and, this, their, is, or, of, and stop words• Remove punctuation marks• Reduce words to their root form (stemming)– Combination of suffix and prefix – Eg: students student
swam swimlemmatization stochastic algorithmscience, scientist??
Documents to be indexed
• Document 1– Search technologies have been around for over
forty years. Over this time, their user base expanded first from scientists and technologists to information professionals, and finally from information professionals to pretty much everyone.
• Document 2– Math and Physics students are familiar with the
challenge of finding the unambiguous “right answer”. The same is not true for information retrieval. Finding the “right document” may be as much art as science.
• Document 3– Many serial killers do not suffer from psychosis
and appear to be quite normal. Search for such killers can take years, even with the latest police technologies, and the results are often shocking.
Stop words for removal• Search technologies have been around for over forty years.
Over this time, their user base expanded first from scientists and technologists to information professionals, and finally from information professionals to pretty much everyone.
• Math and Physics students are familiar with the challenge of finding the unambiguous “right answer”. The same is not true for information retrieval. Finding the “right document” may be as much art as science.
• Many serial killers do not suffer from psychosis and appear to be quite normal. Search for such killers can take years, even with the latest police technologies, and the results are often shocking.
Stemming Changes Identified
• search technology around forty years time user base expanded first science technology information professionals finally information professionals pretty much everyone
• math physics students familiar challenge finding unambiguous right answer information retrieval finding right document much art science
• many serial killers suffer psychosis appear normal search killers take years latest police technology results shocking
Unique words identified• Search[1] technology[2] around[3] forty[4] year[5] time[6]
user[7] base[8] expand[9] first[10] science[11] technology[2] information[12] professional[13] final[14] information[12] professional[13] pretty[15] much[16] everyone[17]
• math[18] physics[19] student[20] familiar[21] challenge[22] find[23] unambiguous[24] right[25] answer[26] information[12] retrieval[27] find[23] right[25] document[28] much[16] art[29] science[11]
• many[30] serial[31] killer[32] psychosis[33] appear[34] normal[35] search[1] killer[32] take[36] year[5] latest[37] police[38] technology[2] result[39] shock[40]
Search Dictionary
[1] search [2] technology [3] around [4] forty [5] year [6] time………[40] shock
Representing documents as 40-dimensional vectors
• Values are in form of <dictionary ref>:<no of occurrences>
• Doc1(1:1, 2:2, 3:1, 4:1, 5:1, 6:1, 7:1,….,13:2,14:1, 15:1,…, 17:1, 18:0, 19:0,…,40:0)
• Doc2(1:0, 2:0, 3:0,…,11:1,12:1,…,16:1,17:0,18:1, 19:1, 20:1,..,29:1,30:0,31:0,….,40:0)
• Doc3(1:1,2:1,3:0,4:0,5:1,6:0,7:0,8:0,…,29:0, 30:1,31:2,32:2,33:1…,40:1)
Handling the Query
• “the promise of search technologies”• the promise of search technology• search and technology are present in dictionary, but
“promise” is not so it will be avoided• Hence the search becomes search technology, which is
equivalent to (1:1, 2:1)....creating a new vector• Converting it to 40 dimensional array (1:1, 2:1, 3:0, 4:0,
….,40:0)• Finally find the shortest distance (best match) between
previously stored vectors.
Enhancements• Weighting multiple occurrences
– (1:1000, 2:1000)• Weighting for phrases
– Search technology– Police technology– Information professional– Information retrieval
• Word clustering– Search/retrieval/find– Technology/science/math/physics– First/final/latest
• Custom biases
Google PageRank• PageRank reflects our view of the
importance of web pages by considering more than 500 million variables and 2 billion terms.
• Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results.
• PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value.
PageRank• A PageRank results from a
mathematical algorithm based on the graph created by all World Wide Web pages as nodes and hyperlinks, taking into consideration authority hubs like Wikipedia.
• The rank value indicates an importance of a particular page.
• A hyperlink to a page counts as a vote of support.
Google Page ranking
• PR(A) = (1-d)+d (PR(T1)/C(T1) + ….. + PR(Tn)/C(Tn))
A page in questionT1…Tn documents that referencePR page rankC(Ti) total number of links to outside resources on
page Tid heuristic damping factor usually set to 0.85
• Content is not taken into account when PageRank is calculated.• Not all links weight the same when it comes to PR.• If you had a web page with a PR8 and had 1 link on it, the site linked to would get a
fair amount of PR value. But, if you had 100 links on that page, each individual link would only get a fraction of the value.
• Bad incoming links don’t have impact on Page Rank.• Ranking popularity considers site age, backlink relevancy and backlink duration.
PageRank doesn’t.• PageRank does not rank web sites as a whole, but is determined for each page
individually.• Each inbound link is important to the overall total. Except banned sites, which don’t
count.• PageRank values don’t range from 0 to 10. PageRank is a floating-point number.• Each Page Rank level is progressively harder to reach. PageRank is believed to be
calculated on a logarithmic scale.• Google calculates pages PRs permanently, but we see the update once every few
months (Google Toolbar).
• Frequent content updates don’t improve Page Rank automatically. Content is not part of the PR calculation.
• High Page Rank doesn’t mean high search ranking.• DMOZ and Yahoo! Listings don’t improve Page Rank automatically.• .edu and .gov-sites don’t improve Page Rank automatically.• Sub-directories don’t necessarily have a lower Page Rank than root-
directories.• Wikipedia links don’t improve PageRank automatically (update: but
pages which extract information from Wikipedia might improve PageRank).
• Links marked with nofollow-attribute don’t contribute to Google PageRank.
• Efficient internal onsite linking has an impact on PageRank.• Links from and to high quality related sites have an impact on Page Rank.• Multiple votes to one link from the same page cost as much as a single
vote.
Web Spiders
• Selection policy• Re-visit policy