Nutch in Nutshell

download Nutch in Nutshell

of 20

Transcript of Nutch in Nutshell

  • 8/8/2019 Nutch in Nutshell

    1/20

    Nutch in a Nutshell

    Presented by Liew Guo Min

    Zhao Jin

  • 8/8/2019 Nutch in Nutshell

    2/20

    Outline

    Recap

    Special features Running Nutch in a distributed

    environment (with demo)

    Q&A

    Discussion

  • 8/8/2019 Nutch in Nutshell

    3/20

    Recap Complete web search engine

    Nutch = Crawler + Indexer/Searcher (Lucene) + GUI

    + Plugins+ MapReduce & Distributed FS (Hadoop)

    Java based, open source

    Features: Customizable

    Extensible

    Distributed

  • 8/8/2019 Nutch in Nutshell

    4/20

    Nutch as a crawlerInitial URLs

    Generator Fetcher

    Segment

    Webpages/files

    Web

    Parsergenerate

    Injector

    CrawlDB

    read/write

    CrawlDBTool

    update get

    read/write

  • 8/8/2019 Nutch in Nutshell

    5/20

    Special Features Extensible (Plugin system)

    Most of the essential functionalities of Nutch

    are implemented as plugins Three layers

    Extension points

    What can be extended: Protocol, Parser, ScoringFilter, etc.

    Extensions The interfaces to be implemented for the extension points

    Plugins

    The actual implementation

  • 8/8/2019 Nutch in Nutshell

    6/20

    Special Features Extensible (Plugin system)

    Anyone can write a plugin

    Write the code Prepare metadata files

    Plugin.xml: what has been extended by what

    Build.xml: how ant can build your source code

    Ask nutch to include your plugin in conf/nutch-site.xml

    Tell ant to build your in src/plugin/build.xml

    More details @

    http://wiki.apache.org/nutch/PluginCentral

  • 8/8/2019 Nutch in Nutshell

    7/20

    Special Features Extensible (Plugin system)

    To use a plugin

    Make sure you have modified Nutch-site.xml toinclude the plugin

    Then, either

    Nutch would automatically call it when needed, or

    You can write something to call it with its classname and

    then use it

  • 8/8/2019 Nutch in Nutshell

    8/20

    Special Features Distributed (Hadoop)

    Map-Reduce (Diagram)

    A framework for distributed programming Map -- Process the splits of data to get

    intermediate results and the keys to indicate what

    should be put together later

    Reduce -- Process the intermediate results withthe same key and output final result

  • 8/8/2019 Nutch in Nutshell

    9/20

    Special Features Distributed (Hadoop)

    MapReduce in Nutch

    Example1: Parsing Input: files from fetch

    Map(url,content) by calling parser plugins

    Reduce is identity

    Example2: Dumping a segment Input: , etc. files from

    segment

    Map is identity

    Reduce(url, value*) bysimply concatenating the text representation of values

  • 8/8/2019 Nutch in Nutshell

    10/20

    Special Features Distributed (Hadoop)

    Distributed File system Write-once-read-many coherence model

    High throughput Master/slave

    Simple architecture

    Single point of failure

    Transparent

    Access via Java API More info @

    http://lucene.apache.org/hadoop/hdfs_design.html

  • 8/8/2019 Nutch in Nutshell

    11/20

    Running Nutch in a distributed

    environment

    MapReduce

    In hadoop-site.xml

    Specify job tracker host & port

    mapred.job.tracker

    Specify task numbers

    mapred.map.tasks

    mapred.reduce.tasks

    Specify location for temporary files

    Mapred.local.dir

  • 8/8/2019 Nutch in Nutshell

    12/20

    Running Nutch in a distributed

    environment

    DFS

    In hadoop-site.xml

    Specify namenode host, port & directory

    fs.default.name

    dfs.name.dir

    Specify location for files on each datanode

    dfs.data.dir

  • 8/8/2019 Nutch in Nutshell

    13/20

    Demo time!

  • 8/8/2019 Nutch in Nutshell

    14/20

    Q&A

  • 8/8/2019 Nutch in Nutshell

    15/20

    Discussion

  • 8/8/2019 Nutch in Nutshell

    16/20

    Exercises Hands-on exercises

    Install Nutch, crawl a few webpages using the crawl commandand perform a search on it using the GUI

    Repeat the crawling process without using the crawl command

    Modify your configuration to perform each of the following crawljobs and think when they would be useful. To crawl only webpages and pdfs but not anything else

    To crawl the files on your harddisk

    To crawl but not to parse

    (Challenging) Modify Nutch such that you can unpack thecrawled files in the segments back into their original state

  • 8/8/2019 Nutch in Nutshell

    17/20

    Reference http://wiki.apache.org/nutch/PluginCentral -- Information on Nutch

    plugins

    http://lucene.apache.org/hadoop/ -- Hadoop homepage

    http://wiki.apache.org/lucene-hadoop/ -- Hadoop Wiki http://wiki.apache.org/nutch-

    data/attachments/Presentations/attachments/mapred.pdf

    "MapReduce in Nutch"

    http://wiki.apache.org/nutch-

    data/attachments/Presentations/attachments/oscon05.pdf "ScalableComputing with MapReduce

    http://www.mail-archive.com/nutch-

    [email protected]/msg01951.html Updated tutorial on

    setting up Nutch, Hadoop and Lucene together

  • 8/8/2019 Nutch in Nutshell

    18/20

    Excursion: MapReduce Problem

    Find the number of occurrences of cat in a

    file

    What if the file is 20GB large?

    Why not do it with more computers?

    SolutionPC1

    PC2

    200

    300

    PC1 500Split 1

    Split 2File

  • 8/8/2019 Nutch in Nutshell

    19/20

    Excursion: MapReduce Problem

    Find the number of occurrences of both cat

    and dog in a very large file

    SolutionPC1

    PC2

    200,

    250

    300,

    250

    PC1 cat:500Split 1

    Split 2File

    cat: 200,

    dog: 250

    cat: 300,

    dog: 250PC2 dog:500

    cat: 200,

    300

    dog: 250,

    250

    Input Files

    Map

    Intermediate files

    Reduce

    Output files

    Sort/Group

  • 8/8/2019 Nutch in Nutshell

    20/20

    Excursion: MapReduce Generalized Framework

    Split 1

    Split 2

    Split 3

    Split 4

    Worker

    Worker

    Worker

    k1:v1

    k3:v2

    k1:v3

    k2:v4

    k2:v5

    k4:v6

    k1:v1,v2

    k2:v4,v5

    k3:v2

    Worker

    Worker

    Worker Output 1

    Output 2

    k4:v6

    Output 3

    Master

    back

    Input Files

    Map

    Intermediate files

    Reduce

    Output files

    Sort/Group