How a Search Engine Works_report

download How a Search Engine Works_report

of 16

Transcript of How a Search Engine Works_report

  • 8/8/2019 How a Search Engine Works_report

    1/16

    DEPARTMENT OF COMPUTER SCIENCE AND

    ENGINEERING

    SYNERGY INSTITUTE OF ENGINEERING &

    TECHNOLOGY, DHENKANAL

    SEMINAR ON

    How a search engineworks?

  • 8/8/2019 How a Search Engine Works_report

    2/16

    Seminar Report 10 How a search engine works

    Dept. of CSE S.I.E.T, Dhenkanal

    Guided by : XxXSubmitted by:

    SOVAN MISRACS-O7-42

    07012301472

  • 8/8/2019 How a Search Engine Works_report

    3/16

    Seminar Report 10 How a search engine works

    DEPARTMENT OF COMPUTER SCIENCE &

    ENGINEERING

    SYNERGY INSTITUTE OF ENGINEERING

    AND TECHNOLOGY

    DHENKANAL

    CERTIFICATE

    Certified that this is a bonafide record of the seminar entitledHOW A

    SEARCH ENGINE WORKSdone by the following student SOVAN

    MISRAof the 7th semester, Computer Science and Engineering in the

    year 2010 in partial fulfilment of the requirements of the award of

    Degree of Bachelor of Technology in Computer Science and

    Engineering of Synergy Institute Of Engineering And Technology,

    Dhenkanal

    XxX XxX

    Seminar Guide Head of the Department

    Dept. of CSE S.I.E.T, Dhenkanal3

  • 8/8/2019 How a Search Engine Works_report

    4/16

    Seminar Report 10 How a search engine works

    ACKNOWLEDGEMENT

    I thank my seminar guide XxX, Lecturer, SIET, for her proper

    guidance, and valuable suggestions. I am indebted to XxX, the HOD,

    Computer Science department & other faculty members for giving me

    an opportunity to learn and do this seminar. If not for the above

    mentioned people my seminar would never have been completed

    successfully. I once again extend my sincere thanks to all of them.

    SOVAN MISRA

    Dept. of CSE S.I.E.T, Dhenkanal4

  • 8/8/2019 How a Search Engine Works_report

    5/16

    Seminar Report 10 How a search engine works

    HOW SEARCH ENGINE WORKS?

    INTRODUCTION

    What is a search engine?

    Search engine is a software program that searches a database and gathers

    reports, information that contains or is related to specified terms.Or

    It is a website whose primary function is providing a search for gathering

    and reporting informations available on the Internet or a portion of

    internet.

    Why Search Engine?

    In todays world we have million and billions of information available inthe vast World Wide Web (WWW). If one has to search some information

    it will kill lots of time of the user. For this purpose we should have certain

    tools for making this searching automatic, Quick, and Effortless.

    So, to reduce the problem to a , more or less manageable solution, Web

    Search Engine were introduced a few years ago.

    Different Search engines:

    Dept. of CSE S.I.E.T, Dhenkanal5

  • 8/8/2019 How a Search Engine Works_report

    6/16

    Seminar Report 10 How a search engine works

    History of search engines:-

    In 1990, the first search engine ARCHIE was released, at that time there is

    no World Wide Web. Data resided on defence contractor, university, and

    government computers and techies were the only people accessing the data.

    The computers are interconnected by Telnet*.

    File Transfer Protocol (FTP) used for transferring file from computer to

    computer.There is no such thing called a Browser. So, information or data are

    transferred in their native format and viewed using the associated file type

    software.

    Archie searched FTP servers and indexed their files into a searchable

    directory.

    In 1991, Ghopherspace come into existence with the advantage of Gopher.

    It catalogued FTP sites and the resulting catalogue become known as

    Gopher space.

    In 1994, WebCrawler, a new type of search engine that indexed the entirecontents pf a webpage, was introduced.

    In between 1995-1998, many changes and development occurred in the

    world of search engines. Meta tags* is the webpage were first utilized by

    some search engines to determine relevancy.

    Search engine rank-checking software was introduced. It provides an

    automated tool to determine web sites position and ranking within the

    major search engines.

    In around 1998, search engine Algorithms was introduced to optimize the

    searching.In 2000, Marketer determines that pay-per click campaigns were an easy

    yet expensive approach for gaining top search rankings. To elevate sites in

    the searching engine ranking websites started adding useful and relevant

    content while optimizing their

    WebPages for each specific search engines. And still the search engines

    optimization (SEO) is going on by improving the algorithms.

    TYPE OF SEARCH ENGINES:

    Dept. of CSE S.I.E.T, Dhenkanal6

  • 8/8/2019 How a Search Engine Works_report

    7/16

    Seminar Report 10 How a search engine works

    On the basis of working, search engine is categories in the following

    groups:

    * Crawler-based search engine.

    * Directories.* Hybrid search engines.

    * Meta search engine.

    CRAWLER BASED SEARCH ENGINE:

    It uses automated software programs to survey and categorises WebPages,

    which is known as spiders ,crawlers ,robots and bots.

    A spider will find a web page, download it and analyses the information

    presented on the WebPages. The webpage will then be added to the search

    engines database.

    When a user performs a search , the search engine will check its database

    of WebPages for the key word the user searched.

    The results are ordered as per the bots algorithm in the search engine result

    pages (SERPs).

    Ex:-

    GOOGLE (www.google.com)

    ASK (www.ask.com)

    Dept. of CSE S.I.E.T, Dhenkanal7

    http://www.google.com/http://www.ask.com/http://www.google.com/http://www.ask.com/
  • 8/8/2019 How a Search Engine Works_report

    8/16

    Seminar Report 10 How a search engine works

    SPIDERS ALGORITHMS :

    All spiders use the following algorithms for retrieving documents from the

    web:

    The algorithm uses a list of known URLs. This lists contains at least one

    URL to start with.

    The document is parsed to retrieve information for the index database andto extract the embedded link to other documents.

    Dept. of CSE S.I.E.T, Dhenkanal8

  • 8/8/2019 How a Search Engine Works_report

    9/16

    Seminar Report 10 How a search engine works

    The URL of the links found in the document are added to the list of known

    URLs.

    If the list is empty or some limit exceed (number of documents retrieved,

    size of the indexed database, etc) the algorithm stops, otherwise the

    algorithm continues at steps 2.

    Crawlers program treats World Wide Web as big graph having pages as

    nodes and the hyperlinks as arcs.

    Crawlers works with a simple goal, indexing all keywords in the

    webpage titles.

    Three Data structures is needed for crawlers or spider algorithms

    A large linear array, URL_Table.

    Heap

    Hash table

    Dept. of CSE S.I.E.T, Dhenkanal9

  • 8/8/2019 How a Search Engine Works_report

    10/16

    Seminar Report 10 How a search engine works

    URL_table:

    It is a large linear array that contains millions of entries.

    Each entry contains two pointers:

    Pointer to URL

    Pointer to Title.

    These are variables length strings and kept as heap.

    Heap:It is a large unstructured chunk of virtual memory to which strings can be

    appended.

    Hash table:It is the third data structure of size n entries

    Any URL can be run through a hash function to produce a non-negative

    integer less than n.

    All URL that hash to the value k are hooked together on a linked list

    starting at the entry k of the hash table.

    Every entry in the URL_table is also entered into the hash table.

    The main use of hash table is to start with a URL and be able to quickly

    determine whether it is already present in URL_Table.

    Dept. of CSE S.I.E.T, Dhenkanal10

  • 8/8/2019 How a Search Engine Works_report

    11/16

    Seminar Report 10 How a search engine works

    DATA STRUCTURE FOR CRAWLER

    Building the index requires two phases: Searching (URL processing )

    Indexing.

    The heart of the search engine is a recursive procedure procees_url, which

    takes a URL string as input

    Searching is done by procedure, procees_url as follows:-

    It hashes the URL to see if it is already present in url_table. If so, it

    is done and returns immediately.

    If the URL is not already known, its page is fetched.

    The URL and title are then copied to the heap and pointers to these

    two strings are entered in url_table.

    The URL is also entered into the hash table.

    Finally, process_url extracts all the hyperlinks from the page and

    calls process_url once per hyperlink, passing the hyperlinks URL as the

    input parameter

    For each entry in url_table, indexing procedure will examine the title

    and selects out all words not on the stop list.

    Dept. of CSE S.I.E.T, Dhenkanal11

  • 8/8/2019 How a Search Engine Works_report

    12/16

    Seminar Report 10 How a search engine works

    Each selected word is written on to a file with a line consisting of the

    word followed by the current url_table entry number.

    When the whole table has been scanned, the file is shorted by word.

    Formulating quires:

    Keyword submission cause a request to be done in the machine

    where the index is located (web server).

    Then the keyword is looked up in the index database to find the set

    of URL indices for each keyword.Indexed into url_table to find all the titles and urls. Then it is stored

    in the Document server.

    These are then combined to form a web page and sent back to user as the

    response.

    Determining Relevance

    Classic algorithm "TF / IDFis used for determining relevance.

    Dept. of CSE S.I.E.T, Dhenkanal12

  • 8/8/2019 How a Search Engine Works_report

    13/16

    Seminar Report 10 How a search engine works

    It is a weight often used in information retrieval and text mining.

    This weight is a statistical measure used to evaluate how important a

    word is to a document in a collection

    A high weight in TF-IDF is reached by a high term frequency (in thegiven document) and a low document frequency of the term in the

    whole collection of documentsTerm Frequency

    Term Frequency -The number of times a given term appears in

    that document.

    It gives a measure of the importance of the term ti within theparticular document.

    Term Frequency,

    Where, ni is the number of occurrences of the considered term, andthe denominator is the number of occurrences of all terms.

    E.g.

    If a document contains 100 total words and the word computer

    appears 3 times, then the term frequency of the word computerin thedocument is 0.03 (3/100)

    Inverse Document Frequency

    The inverse document frequency is a measure of the general importanceof the term (obtained by dividing the number of all documents by the

    number of documents containing the term, and then taking the logarithm of

    that quotient).

    Where,

    |D | : total number of documents in the corpus

    Dept. of CSE S.I.E.T, Dhenkanal13

  • 8/8/2019 How a Search Engine Works_report

    14/16

    Seminar Report 10 How a search engine works

    : number of documents where the term tiappears (that is ni!= 0)

    Inverse Document Frequency

    There are different ways of calculating the IDF

    Document Frequency (DF) is to determine how many

    documents contain the word anddivide it by the total number ofdocuments in the collection.

    E.g.1) If the word computerappears in 1,000 documents out of

    a total of 10,000,000 then the IDF is 0.0001(1000/10,000,000).

    2) Alternately, take the log of the document frequency.

    The natural alogarithm is commonly used. In this

    example we would have

    IDF = ln(1,000 / 10,000,000) =1/ 9.21

    The final TF-IDF score is then calculated by dividing the Term

    Frequency by the Document Frequency.

    E.g.

    The TF-IDF score forcomputerin the collection would be :

    1)TF-IDF = 0.03/0.0001= 300 , by using first formula of IDF.2)If alternate formula used we would have

    TF-IDF = 0.03 * 9.21 = 0.27.

    Dept. of CSE S.I.E.T, Dhenkanal14

  • 8/8/2019 How a Search Engine Works_report

    15/16

    Seminar Report 10 How a search engine works

    OTHER TYPE OF SERCHING TECHNIQUES:

    Directories

    The human editors comprehensively check the website and

    rank it, based on the information they find, using a pre-defined set of

    rules.There are two major directories :

    Yahoo Directory (www.yahoo.com)

    Open Directory (www.dmoz.org)

    Hybrid Search EnginesHybrid search engines use a combination of both crawler-based

    results and directory results.

    Dept. of CSE S.I.E.T, Dhenkanal15

    http://www.yahoo.com/http://www.dmoz.org/http://www.yahoo.com/http://www.dmoz.org/
  • 8/8/2019 How a Search Engine Works_report

    16/16

    Seminar Report 10 How a search engine works

    Examples of hybrid search engines are:

    Yahoo (www.yahoo.com)

    Google (www.google.com)

    Meta Search Engines

    Also known as Multiple Search Engines or Metacrawlers.

    Meta search engines query several other Web search engine

    databases in parallel and then combine the results in one list.

    Examples of Meta search engines include:

    Metacrawler (www.metacrawler.com)

    Dogpile (www.dogpile.com)

    References:

    http://computer.howstuffworks.com/internet/basics/search-engine.htm

    http://searchenginewatch.com/2168031

    http://www.infotoday.com/searcher/may01/liddy.htm

    http://www.slideshare.net/jsuleiman/how-search-engines-work-

    presentation

    Hey!

    This is Sovan

    Please send your feedbacks @

    [email protected]

    Dept. of CSE S.I.E.T, Dhenkanal16

    http://www.yahoo.com/http://www.google.com/http://www.metacrawler.com/http://www.dogpile.com/http://computer.howstuffworks.com/internet/basics/search-engine.htmhttp://searchenginewatch.com/2168031http://www.infotoday.com/searcher/may01/liddy.htmhttp://www.slideshare.net/jsuleiman/how-search-engines-work-presentationhttp://www.slideshare.net/jsuleiman/how-search-engines-work-presentationhttp://www.yahoo.com/http://www.google.com/http://www.metacrawler.com/http://www.dogpile.com/http://computer.howstuffworks.com/internet/basics/search-engine.htmhttp://searchenginewatch.com/2168031http://www.infotoday.com/searcher/may01/liddy.htmhttp://www.slideshare.net/jsuleiman/how-search-engines-work-presentationhttp://www.slideshare.net/jsuleiman/how-search-engines-work-presentation