Extensible Information Retrieval with Apache Nutch Aaron Elkiss 16-Feb-2006.
-
date post
21-Dec-2015 -
Category
Documents
-
view
219 -
download
3
Transcript of Extensible Information Retrieval with Apache Nutch Aaron Elkiss 16-Feb-2006.
Extensible Information Retrievalwith Apache Nutch
Aaron Elkiss
16-Feb-2006
Why use Nutch?
• Front-end to large collections of documents
• Demonstrate research without writing lots of extra code
Outline
• Nutch - information retrieval– Pros & Cons– Crawling the Local Filesystem– How Nutch Works– Indexing a Database– Query Filters: Searching with Nutch
Nutch
• Open source search engine
• Written in Java
• Built on top of Apache Lucene
Advantages of Nutch
• Scalable– Index local host or entire Internet
• Portable– Runs anywhere with Java
• Flexible– Plugin system + API
• Code pretty easy to read & work with• Better than implementing it yourself!
Disadvantages of Nutch
• Documentation still somewhat lacking
• Not yet fully mature
• No GUI
• Odd Tomcat setup
• Several “gotchas”
Crawling the Local Filesystem
• Step 1: Create list of files to indexfile_list:
/data0/projects/clairlib/CLAIR/aleClairlib.pl/data0/projects/clairlib/CLAIR/buildALE.pl/data0/projects/clairlib/CLAIR/get_cosine_example.pl/data0/projects/clairlib/CLAIR/lookUpTFIDF.pl/data0/projects/clairlib/CLAIR/makeCorpus.pl/data0/projects/clairlib/CLAIR/normalize_cosines.pl/data0/projects/clairlib/CLAIR/queryALE.pl/data0/projects/clairlib/CLAIR/testCluster.pl/data0/projects/clairlib/CLAIR/testCorpusDownload.pl/data0/projects/clairlib/CLAIR/testDocument.pl/data0/projects/clairlib/CLAIR/testDocumentPair.pl/data0/projects/clairlib/CLAIR/testIP.pl/data0/projects/clairlib/CLAIR/testUtil.pl/data0/projects/clairlib/CLAIR/testWebSearch.pl/data0/projects/clairlib/CLAIR/NSIR/bin/testNSIR.pl/data0/projects/clairlib/CLAIR/NSIR/bin/nsir_web.pl/data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Parser.pl/data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Tnt2PreCass.pl/data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanEmptySentences.pl/data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanPunctuation_tnt.pl
Crawling the Local Filesystem
• Step 2: Edit Configuration– crawl-urlfilter.txt
• Very restrictive by default• Must allow file: URLs
crawl-urlfilter.txt default# Each non-comment, non-blank line contains a regular expression# prefixed by '+' or '-'. The first matching pattern in the file# determines whether a URL is included or ignored. If no pattern# matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
# skip URLs containing certain characters as probable queries, etc.-[?*!@=]
# accept hosts in MY.DOMAIN.NAME+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
# skip everything else-.
crawl-urlfilter.txt
# Each non-comment, non-blank line contains a regular expression# prefixed by '+' or '-'. The first matching pattern in the file# determines whether a URL is included or ignored. If no pattern# matches, the URL is ignored.
# skip image and other suffixes we can't yet parse.\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
# allow everything else+.
Crawling the Local Filesystem
• Step 3: Edit Configuration– nutch-site.xml (overrides nutch-default.xml)
• Enable protocol-file plugin and parse plugins<nutch-conf>
<property> <name>plugin.includes</name> <value>nutch-extensionpoints|protocol-file|urlfilter-regex|parse-(text|html|pdf|msword)|index-basic|query-(basic|site|url)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property></nutch-conf>
Crawling the Local Filesystem
• Step 4: Run the crawl– bin/nutch crawl myurls
• Step 5: Start Tomcat– GOTCHA: must start in the crawl directory!– Or edit WEB-INF/classes/nutch-site.xml
<nutch-conf> <property> <name>searcher.dir</name> <value>/oriole0/nutch-0.7.1/crawl-20051208231019</value> </property></nutch-conf>
Modifying the Results Page
• Just customize search.jsp!
• For example, display external ‘citations’ link instead of ‘anchors’
(<a href="../explain.jsp?<%=id%>&query=<%=URLEncoder.encode(queryString)%>"><i18n:message key="explain"/></a>)
(<a href="http://oriole.eecs.umich.edu/cgi-bin/citations.pl?<%=url%>">citations</a>)
<%-- (<a href="../anchors.jsp?<%=id%>"><i18n:message key="anchors"/></a>) --%>
How Nutch Works
• Protocol plugin
URL
Content
byte[] contentString contentTypeURL urlProperties metadata
Protocol.getProtocolOutput
How Nutch Works
• Parsing plugins
URL
Content
byte[] contentString contentTypeURL urlProperties metadata
Protocol.getProtocolOutput
ParseString text
Parser.getParse
ParseData data
Properties metadataOutlink[] outlinksString titleParseStatus status
Indexing a Database
• Need to write a new plugin
• Luckily interface is pretty simple
• Much less tightly coupled than full-text search inside database
Indexing a Database
• Approach– Get the text out– Generate a 1:1 mapping from URLs to
documents in the database
Indexing a Database
• Protocol plugin– Replaces default ‘http’ plugin– Converts http request to database request
Indexing a Database
• Parse plugin– Replaces text or HTML parser– Protocol plugin gets the text and metadata,
so don’t need to do much here
Indexing a Database
• Configuration - plugin.xml
Indexing a Database
• Configuration - nutch-site.xml– Add correct plugin
• Make sure Nutch can find plugin– $NUTCH_HOME/plugins
Improving the Plugin
• Configuration via XML
• Determine which database to use for what URLs
• Automatically ‘crawl’ database
• Pass unknown URLs to default plugin
Searching with Nutch
• Parse query - NutchAnalysis
• Filter query - QueryFilters
• Pass to Lucene - IndexSearcher– Optimization/caching -
LuceneQueryOptimizer– Translate hits from Lucene back to Nutch
Query Filter
Nutch Query QueryFilter.filter()Lucene Query
Date Query Filter
• Date query filter restricts by date
Basic Query Filter
• Boosts weight of particular fields
• Manipulates phrases
Additional Query Filters
• Could implement relevance feedback in this framework
• Manual relevance feedback– could add morelike:somedocument
operator
• Automatic relevance feedback - extend BasicQueryFilter
Additional Capabilities
• Distributed searching– Nutch Distributed File System
• MapReduce a la Google
• More
Nutch Distributed Filesystem
• Write-once
• Stream-oriented (append-only, sequential read)
• Distributed, transparent, replicated, fault-tolerant
• Distribute index and content
MapReduce
• Distributed processing technique
• Idea from functional programming
Map
• Apply same operation to several data items• Example (Python):
def getDocument(docid):""" fetch document with given docid from database """# do some stuff ...return document
docids = [1, 2, 3, 4, 5]
documents = map(getDocument,docids)
• Mapping for individual items is independent - distributable!
Reduce
• Combine results of map operation• Simple example - sum of squares
measurements = [4, 2, 6, 9]def sum(x,y):
return x+y
def square(x):return x^2
result = reduce(sum,map(square,measurements))
• Can use to distribute crawling, indexing, etc
MapReduce in Nutch
Conclusions
• Nutch is– featureful– flexible– extensible– scalable
• Get started with nutch: http://lucene.apache.org/nutch
• Sample plugins and code samples:http://umich.edu/~aelkiss/nutch