Search Engines

Internet protocol (IP)• Two major functions: Addresses that identify hosts, locations and

identify destination• Connectionless protocol• Reliability:

– Data corruption : faulty way of breaking up messages– Lost data packets– Duplicate arrival

• IP addresses for host: – A numerical label assigned to each device participating in a computer network that

uses internet protocol for communication– Hostnames (Ex: cs.umb.edu)– We prefer meaningful names but behind the scenes hostnames are converted to IP

addresses which are a series of four decimal numbers separated by dots. Ex: 205.39.155.18 (stored in 32 bits). What is a potential problem with 32 bit

addresses?– Can be split into network address (type/size of network) and a host number

(machine/device number on this network)

Domain name systemcs.bu.edu

? ?• Domain names: The part of a hostname that specifies type of organization or group

• Top-level domain (TLD): The last section of a domain name specifying the type of organization or its country of origin.

• Domain name system is used to translate hostnames into numeric IP addresses so domain name servers (when you make a request) translates request into an IP address and then searches for the IP address

World Wide Web• Good News: – Millions of webpages available on a variety of topics

• Bad News:– Millions of webpages available on a variety of topics– Haphazard labeling– Sitting on servers in various locations

How do search for a specific topic?

Search engines!• A web search engine is designed to search for

information on the world wide web and FTP servers– Search based on key words (Crawl)– Keep an index of useful pages (Index)– Presents users with information based on the

index (Search)

• Search engines operate algorithmically or a mix of algorithmic and human input

Search engines: History• Early summer of 1993: No commercial or

large scale search engines existed• W3Catalog :– World’s first primitive search engine – By Oscar Nierstrasz at the University of Geneva

• Wandex :– Web Robot– Mathew Gray at MIT– Measure the size of World Wide Web

Search engines: History• Aliweb :– Indexed by hand

• Jump Station :– Web Robot– Crawl, Index and Search

• 2000: Google rose to fame!– Algorithm called PageRank that ranks webpages

based on the number and PageRank of links available on the website

Difference• Early search engines– Few hundred thousand pages– One or two thousand inquiries

• Top search engines– Few hundred millions of pages – Billions of queries per day

Web Crawling• Also known as a spider• Special software agent that finds

web pages, also follows links on web pages

• Contents are analyzed– Words, titles, special fields called meta

tags

• Starting point?– Popular pages

Google: Web Crawling• At its peak:– Use multiple spiders– Each spider can keep ~300 connections to pages at a time– Generates 600K/s

• Starting points:– Dedicated server that feeds URLs to spiders– Instead of relying on ISP for domain names they

have their own DNS server• Google spider looks at two things:– Significant words within the page– Location of the words Why is location important?

Meta tags• Owner specific– Can be helpful– Problem?– Robot exclusion protocol

Indexing• Spiders get the data– Now what?– Content analysis– Method by which information is sorted and

stored

• One way: Storing the word and associated URL – No way to tell if the word is important or trivial– How many times was the word used?

Ranking• A relationship between items about

their ordering• For more useful information:– Number of times word appears on page– Assign a weight to each word

• Each search engine has a different formula for assigning weight to words in its index

• Popular way of indexing : Hashing– Numerical value assigned to each word

that can be retrieved using a formula

Building a search• Query: string of words or a single word• Complex queries requires the use of

Boolean operators– AND : terms joined by operator, also ‘+’– OR– NOT– FOLLOWED BY– NEAR– Quotation Marks

Building a search• Literal searches: based on Boolean

operators• Concept-based: Statistical analysis on

pages containing words or phrases you search for– Information stored about each page is

greater– Search times may be longer

• Natural language queries – Ask a question : AskJeeves.com– Parses keywords

Money money money..• Beyond selling shares or private investment• Three main methods:– Online purchases– Web advertising• Keywords relating to product, service or business

– Allowing users to integrate ads into their own websites

– Fourth shady way: Selling user information

Google Company Culture• Sergey Brin and Larry Page began google with

a few networked computers at Stanford• Multibillion dollar organization– >19000 employees globally– Market Capitalization >$145 billion

• Googleplex:– Free food – gourmet café stations– Snack rooms– Exercise rooms– Game rooms– Grand piano

Search Engines

Documents

Transcript of Search Engines