Clickstream Analysis Using Hadoop

download Clickstream Analysis Using Hadoop

of 16

Transcript of Clickstream Analysis Using Hadoop

  • 7/25/2019 Clickstream Analysis Using Hadoop

    1/16

    Click Stream Analysis usingHadoop

  • 7/25/2019 Clickstream Analysis Using Hadoop

    2/16

    ClickStream Analysis

    Click Stream are records of users interaction with

    a website or other compute application. Eachrow of the Click Stream contains a timestamp andan indication of what the user did. Every click orother action is loggedhence the term ClickStream!. "his is useful when the website doesdi#erent things for di#erent users$ such as postrecommendations.

  • 7/25/2019 Clickstream Analysis Using Hadoop

    3/16

    %ata

    %ata is obtained from the site in the form of clickstream records. Each record consists of thedetails of clicks by the visitors and each recordcontains the following details&

    Server '(

    Client '(

    "ime stamp with %ate

    )*+ visted

    ,o. of bytes transferred

    Custom record-s

    "he country of origin for a speci/c re0uest isidenti/ed using the '( address.

  • 7/25/2019 Clickstream Analysis Using Hadoop

    4/16

    1ethodology

    'n order to perform our algorithm on a larger data set$ wemust know big data and Hadoop.

    2hat is 3ig data4

    A dataset that has the three characteristics can be called bigdata&5

    6olume 5 of order tera byte to peta byte

    6ariety 5 Structured 5 *elational database

    )nstructured 5 te7t$ pdf$ word$ image

    Semistructured 5 7ml$ log /les

    6elocity 5 How fast data comes -eg more than 8999 tweets

    per second

  • 7/25/2019 Clickstream Analysis Using Hadoop

    5/16

    "o handle such big amount of data best

    solution is the distributed approach$where system can process all types ofdata in a distributed manner.

    :oogles solution 5 %ivide the task$assign to many computers$ collectresult$ integrate and form /nal result."his is called the 1ap5*educe algorithm.

  • 7/25/2019 Clickstream Analysis Using Hadoop

    6/16

    H%;S-Hadoop %istributed ;ile System

    "his uses a block si13$ or ?@13 -recommended."hismay even use block si

  • 7/25/2019 Clickstream Analysis Using Hadoop

    7/16

    Hadoop Architecture

    Hadoop follows a master5slave architecture.

    ,ame,ode &"his acts as the master and so isthe most vital component of hadoop. "his is

    the book keeper of the H%;S 5 how /les arebroken $ and which node contains those etc.

    %ata,ode &"his act as slave and there aremany slave nodes in a hadoop cluster unlikenamenode i.e. oneDcluster. "his can directlyaccess local /le system and performreadDwrite operations

  • 7/25/2019 Clickstream Analysis Using Hadoop

    8/16

    Secondary Name Node(SNN): 't takes

    snapshot of the H%;S metadata atintervals and communicates withnamenode so as to minimi

  • 7/25/2019 Clickstream Analysis Using Hadoop

    9/16

    Simulation *esult

    "he obective of this simulation is to

    collect Click Stream data of )SA:overnment websites which is high involume and velocity$ and store it foranalysis in a cost e#ective manner forenhanced insight and decision making

  • 7/25/2019 Clickstream Analysis Using Hadoop

    10/16

    1ost Clicked 2ebsites

  • 7/25/2019 Clickstream Analysis Using Hadoop

    11/16

    2ebsite 6isitedDCountry

  • 7/25/2019 Clickstream Analysis Using Hadoop

    12/16

    2ebsite 6isitedD1onth

  • 7/25/2019 Clickstream Analysis Using Hadoop

    13/16

    Conclusion

    2e need to create our own mapper class andreduce class for clickstream analysis which can be

    applied to business models.

  • 7/25/2019 Clickstream Analysis Using Hadoop

    14/16

    ;uture 2ork

    pen"racker Clickstream Analysis "ool

    o An interactive tool that lets you see all the visitors on

    your site in real5time$ those both online and oFine.o Every visitor will be represented by an icon.

    o 'f you click on any visitors icon$ you will see a graphicrepresentation of their clickstream. Gou will also seethat visitors pro/le$ which consists of their country oforigin$ their 'S($ technical specs$ the fre0uency of visitsthey have made to your site and search terms thatthey might have used.

    o Gou will also know if they are a /rst5time visitor$ andview the details of their visit$ i.e. the times theyentered and left.

  • 7/25/2019 Clickstream Analysis Using Hadoop

    15/16

    *eferences

    http&DDhadoop.apache.orgD

    http&DDwww.usa.govDAboutDdeveloper5resourcesD?usagovt.shtml

    http&DDwww.cloudera.comDcontentDclouderaDenDaboutDhadoop5and5big5data.html

  • 7/25/2019 Clickstream Analysis Using Hadoop

    16/16

    "hank Gou