Extensible Information Retrieval with Apache Nutch Aaron Elkiss 16-Feb-2006.

Extensible Information Retrievalwith Apache Nutch

Aaron Elkiss

16-Feb-2006

Why use Nutch?

• Front-end to large collections of documents

• Demonstrate research without writing lots of extra code

Outline

• Nutch - information retrieval– Pros & Cons– Crawling the Local Filesystem– How Nutch Works– Indexing a Database– Query Filters: Searching with Nutch

Nutch

• Open source search engine

• Written in Java

• Built on top of Apache Lucene

Advantages of Nutch

• Scalable– Index local host or entire Internet

• Portable– Runs anywhere with Java

• Flexible– Plugin system + API

• Code pretty easy to read & work with• Better than implementing it yourself!

Disadvantages of Nutch

• Documentation still somewhat lacking

• Not yet fully mature

• No GUI

• Odd Tomcat setup

• Several “gotchas”

Crawling the Local Filesystem

• Step 1: Create list of files to indexfile_list:

/data0/projects/clairlib/CLAIR/aleClairlib.pl/data0/projects/clairlib/CLAIR/buildALE.pl/data0/projects/clairlib/CLAIR/get_cosine_example.pl/data0/projects/clairlib/CLAIR/lookUpTFIDF.pl/data0/projects/clairlib/CLAIR/makeCorpus.pl/data0/projects/clairlib/CLAIR/normalize_cosines.pl/data0/projects/clairlib/CLAIR/queryALE.pl/data0/projects/clairlib/CLAIR/testCluster.pl/data0/projects/clairlib/CLAIR/testCorpusDownload.pl/data0/projects/clairlib/CLAIR/testDocument.pl/data0/projects/clairlib/CLAIR/testDocumentPair.pl/data0/projects/clairlib/CLAIR/testIP.pl/data0/projects/clairlib/CLAIR/testUtil.pl/data0/projects/clairlib/CLAIR/testWebSearch.pl/data0/projects/clairlib/CLAIR/NSIR/bin/testNSIR.pl/data0/projects/clairlib/CLAIR/NSIR/bin/nsir_web.pl/data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Parser.pl/data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Tnt2PreCass.pl/data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanEmptySentences.pl/data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanPunctuation_tnt.pl


• Step 2: Edit Configuration– crawl-urlfilter.txt

• Very restrictive by default• Must allow file: URLs

crawl-urlfilter.txt default# Each non-comment, non-blank line contains a regular expression# prefixed by '+' or '-'. The first matching pattern in the file# determines whether a URL is included or ignored. If no pattern# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.-[?*!@=]

# accept hosts in MY.DOMAIN.NAME+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

# skip everything else-.

crawl-urlfilter.txt

# Each non-comment, non-blank line contains a regular expression# prefixed by '+' or '-'. The first matching pattern in the file# determines whether a URL is included or ignored. If no pattern# matches, the URL is ignored.

# skip image and other suffixes we can't yet parse.\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# allow everything else+.


• Step 3: Edit Configuration– nutch-site.xml (overrides nutch-default.xml)

• Enable protocol-file plugin and parse plugins<nutch-conf>

<property> <name>plugin.includes</name> <value>nutch-extensionpoints|protocol-file|urlfilter-regex|parse-(text|html|pdf|msword)|index-basic|query-(basic|site|url)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property></nutch-conf>


• Step 4: Run the crawl– bin/nutch crawl myurls

• Step 5: Start Tomcat– GOTCHA: must start in the crawl directory!– Or edit WEB-INF/classes/nutch-site.xml

<nutch-conf> <property> <name>searcher.dir</name> <value>/oriole0/nutch-0.7.1/crawl-20051208231019</value> </property></nutch-conf>

Modifying the Results Page

• Just customize search.jsp!

• For example, display external ‘citations’ link instead of ‘anchors’

(<a href="../explain.jsp?<%=id%>&query=<%=URLEncoder.encode(queryString)%>"><i18n:message key="explain"/></a>)

(<a href="http://oriole.eecs.umich.edu/cgi-bin/citations.pl?<%=url%>">citations</a>)

<%-- (<a href="../anchors.jsp?<%=id%>"><i18n:message key="anchors"/></a>) --%>

How Nutch Works

• Protocol plugin

URL

Content

byte[] contentString contentTypeURL urlProperties metadata

Protocol.getProtocolOutput

How Nutch Works

• Parsing plugins

URL

Content

byte[] contentString contentTypeURL urlProperties metadata

Protocol.getProtocolOutput

ParseString text

Parser.getParse

ParseData data

Properties metadataOutlink[] outlinksString titleParseStatus status

Indexing a Database

• Need to write a new plugin

• Luckily interface is pretty simple

• Much less tightly coupled than full-text search inside database

Indexing a Database

• Approach– Get the text out– Generate a 1:1 mapping from URLs to

documents in the database

Indexing a Database

• Protocol plugin– Replaces default ‘http’ plugin– Converts http request to database request

Indexing a Database

• Parse plugin– Replaces text or HTML parser– Protocol plugin gets the text and metadata,

so don’t need to do much here

Indexing a Database

• Configuration - plugin.xml

Indexing a Database

• Configuration - nutch-site.xml– Add correct plugin

• Make sure Nutch can find plugin– $NUTCH_HOME/plugins

Improving the Plugin

• Configuration via XML

• Determine which database to use for what URLs

• Automatically ‘crawl’ database

• Pass unknown URLs to default plugin

Searching with Nutch

• Parse query - NutchAnalysis

• Filter query - QueryFilters

• Pass to Lucene - IndexSearcher– Optimization/caching -

LuceneQueryOptimizer– Translate hits from Lucene back to Nutch

Query Filter

Nutch Query QueryFilter.filter()Lucene Query

Date Query Filter

• Date query filter restricts by date

Basic Query Filter

• Boosts weight of particular fields

• Manipulates phrases

Additional Query Filters

• Could implement relevance feedback in this framework

• Manual relevance feedback– could add morelike:somedocument

operator

• Automatic relevance feedback - extend BasicQueryFilter

Additional Capabilities

• Distributed searching– Nutch Distributed File System

• MapReduce a la Google

• More

Nutch Distributed Filesystem

• Write-once

• Stream-oriented (append-only, sequential read)

• Distributed, transparent, replicated, fault-tolerant

• Distribute index and content

MapReduce

• Distributed processing technique

• Idea from functional programming

Map

• Apply same operation to several data items• Example (Python):

def getDocument(docid):""" fetch document with given docid from database """# do some stuff ...return document

docids = [1, 2, 3, 4, 5]

documents = map(getDocument,docids)

• Mapping for individual items is independent - distributable!

Reduce

• Combine results of map operation• Simple example - sum of squares

measurements = [4, 2, 6, 9]def sum(x,y):

return x+y

def square(x):return x^2

result = reduce(sum,map(square,measurements))

• Can use to distribute crawling, indexing, etc

MapReduce in Nutch

Conclusions

• Nutch is– featureful– flexible– extensible– scalable

• Get started with nutch: http://lucene.apache.org/nutch

• Sample plugins and code samples:http://umich.edu/~aelkiss/nutch

Extensible Information Retrieval with Apache Nutch Aaron Elkiss 16-Feb-2006.

Documents

Transcript of Extensible Information Retrieval with Apache Nutch Aaron Elkiss 16-Feb-2006.