Search Engines
description
Transcript of Search Engines
Search Engines
Internet protocol (IP)• Two major functions: Addresses that identify hosts, locations and
identify destination• Connectionless protocol• Reliability:
– Data corruption : faulty way of breaking up messages– Lost data packets– Duplicate arrival
• IP addresses for host: – A numerical label assigned to each device participating in a computer network that
uses internet protocol for communication– Hostnames (Ex: cs.umb.edu)– We prefer meaningful names but behind the scenes hostnames are converted to IP
addresses which are a series of four decimal numbers separated by dots. Ex: 205.39.155.18 (stored in 32 bits). What is a potential problem with 32 bit
addresses?– Can be split into network address (type/size of network) and a host number
(machine/device number on this network)
Domain name systemcs.bu.edu
? ?• Domain names: The part of a hostname that specifies type of organization or group
• Top-level domain (TLD): The last section of a domain name specifying the type of organization or its country of origin.
• Domain name system is used to translate hostnames into numeric IP addresses so domain name servers (when you make a request) translates request into an IP address and then searches for the IP address
World Wide Web• Good News: – Millions of webpages available on a variety of topics
• Bad News:– Millions of webpages available on a variety of topics– Haphazard labeling– Sitting on servers in various locations
How do search for a specific topic?
Search engines!• A web search engine is designed to search for
information on the world wide web and FTP servers– Search based on key words (Crawl)– Keep an index of useful pages (Index)– Presents users with information based on the
index (Search)
• Search engines operate algorithmically or a mix of algorithmic and human input
Search engines: History• Early summer of 1993: No commercial or
large scale search engines existed• W3Catalog :– World’s first primitive search engine – By Oscar Nierstrasz at the University of Geneva
• Wandex :– Web Robot– Mathew Gray at MIT– Measure the size of World Wide Web
Search engines: History• Aliweb :– Indexed by hand
• Jump Station :– Web Robot– Crawl, Index and Search
• 2000: Google rose to fame!– Algorithm called PageRank that ranks webpages
based on the number and PageRank of links available on the website
Difference• Early search engines– Few hundred thousand pages– One or two thousand inquiries
• Top search engines– Few hundred millions of pages – Billions of queries per day
Web Crawling• Also known as a spider• Special software agent that finds
web pages, also follows links on web pages
• Contents are analyzed– Words, titles, special fields called meta
tags
• Starting point?– Popular pages
Google: Web Crawling• At its peak:– Use multiple spiders– Each spider can keep ~300 connections to pages at a time– Generates 600K/s
• Starting points:– Dedicated server that feeds URLs to spiders– Instead of relying on ISP for domain names they
have their own DNS server• Google spider looks at two things:– Significant words within the page– Location of the words Why is location important?
Meta tags• Owner specific– Can be helpful– Problem?– Robot exclusion protocol
Indexing• Spiders get the data– Now what?– Content analysis– Method by which information is sorted and
stored
• One way: Storing the word and associated URL – No way to tell if the word is important or trivial– How many times was the word used?
Ranking• A relationship between items about
their ordering• For more useful information:– Number of times word appears on page– Assign a weight to each word
• Each search engine has a different formula for assigning weight to words in its index
• Popular way of indexing : Hashing– Numerical value assigned to each word
that can be retrieved using a formula
Building a search• Query: string of words or a single word• Complex queries requires the use of
Boolean operators– AND : terms joined by operator, also ‘+’– OR– NOT– FOLLOWED BY– NEAR– Quotation Marks
Building a search• Literal searches: based on Boolean
operators• Concept-based: Statistical analysis on
pages containing words or phrases you search for– Information stored about each page is
greater– Search times may be longer
• Natural language queries – Ask a question : AskJeeves.com– Parses keywords
Money money money..• Beyond selling shares or private investment• Three main methods:– Online purchases– Web advertising• Keywords relating to product, service or business
– Allowing users to integrate ads into their own websites
– Fourth shady way: Selling user information
Google Company Culture• Sergey Brin and Larry Page began google with
a few networked computers at Stanford• Multibillion dollar organization– >19000 employees globally– Market Capitalization >$145 billion
• Googleplex:– Free food – gourmet café stations– Snack rooms– Exercise rooms– Game rooms– Grand piano