CSE 8337 SPRING 2011 PROJECT 3

45
CSE 8337 SPRING 2011 PROJECT 3 Richa Arora

description

CSE 8337 SPRING 2011 PROJECT 3. Richa Arora. Agenda. Tool Identified and Overview Schema.xml Tokenization, Stop words, and Synonym Handling Indexing Data Import Handler Query format and Matching documents to query Function Queries Bibliography. TOOL IDENTIFIED & OVERVIEW. - PowerPoint PPT Presentation

Transcript of CSE 8337 SPRING 2011 PROJECT 3

Page 1: CSE 8337   SPRING 2011  PROJECT 3

CSE 8337 SPRING 2011 PROJECT 3

Richa Arora

Page 2: CSE 8337   SPRING 2011  PROJECT 3

Agenda Tool Identified and Overview Schema.xml Tokenization, Stop words, and Synonym Handling Indexing Data Import Handler Query format and Matching documents to query Function Queries Bibliography

Page 3: CSE 8337   SPRING 2011  PROJECT 3

TOOL IDENTIFIED & OVERVIEW

Page 4: CSE 8337   SPRING 2011  PROJECT 3

Tool Identified & Overview SOLR - Open Source enterprise search platform from

Apache Lucene project Purpose

◦ To implement a full text search functionality in a web application

Commercial Websites using SOLR◦ www.digg.com◦ http://www.whitehouse.gov/ - Uses SOLR via Drupal for site

search w/highlighting & faceting◦ http://beta.fcc.gov/ ◦ http://www.netflix.com/

Page 5: CSE 8337   SPRING 2011  PROJECT 3

SOLR ApplicationWeb server Database server

Web Applicati

on

SOLR

Document

Database

Page 6: CSE 8337   SPRING 2011  PROJECT 3

Features and Technology Features

◦ Full text search◦ Rich document handling (including MS Word, PDF, RTF etc.)◦ HTML administration interface◦ Scalable

Technology◦ Java programming language◦ Lucene Java search library◦ Runs as a search server within a servlet container such as

Tomcat or Jetty

Page 7: CSE 8337   SPRING 2011  PROJECT 3

Functioning of SOLR

DocumentsBrowser based web

interface

Solr Server

Documents for indexing

Search Queries Search Results

Indexing Searching

Index

schema.xml

solrconfig.xml

Page 8: CSE 8337   SPRING 2011  PROJECT 3

Operations in SOLR Documents form the basic unit of SOLR Documents are composed of fields Examples:

◦ Document for Person: Fields – name, height, age, etc.◦ Document for Recipes: Fields – origin, ingredients, etc.

Documents are fed to SOLR SOLR extracts the information from the fields in the documents and

makes it searchable Steps:

◦ Field Analysis◦ Tokenization◦ Filter application◦ Indexing

Page 9: CSE 8337   SPRING 2011  PROJECT 3

SCHEMA.XML

Page 10: CSE 8337   SPRING 2011  PROJECT 3

schema.xml Governs how should SOLR build indexes from input

documents Defines field types and specific fields that the

documents can contain Describes how SOLR should handle the fields when

adding documents to the index or when querying those fields

Page 11: CSE 8337   SPRING 2011  PROJECT 3

Elements of schema.xml<schema>

<types><fields><uniqueKey><defaultSearchField><solrQueryParser defaultOperator><copyField>

</schema>

Page 12: CSE 8337   SPRING 2011  PROJECT 3

Analyzers These are used for examining the text of fields and to generate a token

stream Indexing Analyzers: The results of the analysis are added to an index and

a set of terms like positions, sizes, etc for a field are defined Querying Analyzers: The values being searched for are analyzed and the

terms that result are matched against those that are stored in the field's index

<fieldType name=“nametext” class=“solr.TextField”><analyzer type=“index”><tokenizer class=“solr.StandardTokenizerFactory”/><filter class=“solr.LowerCaseFilterFactory”/><filter class=“solr.KeepWordFilterFactory” words=“keepwords.txt”/><filter class=“solr.SynonymFilterFactory” synonyms=“syns.txt”/></analyzer><analyzer type=“query”><tokenizer class=“solr.StandardTokenizerFactory”/><filter class=“solr.LowerCaseFilterFactory”/></analyzer>

</fieldType>

Page 13: CSE 8337   SPRING 2011  PROJECT 3

TOKENIZATION, STOP WORDS, AND SYNONYM

HANDLING

Page 14: CSE 8337   SPRING 2011  PROJECT 3

Tokenization To splits a stream of text into tokens Tokens are subsequences of the characters A token contains various metadata in addition to its text value, such as the

location at which the token occurs in the field<fieldType name="text" class="solr.TextField">

<analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> </analyzer>

</fieldType> Example

◦ Standard Tokenizer: Treats whitespace and punctuation as delimiters Input: “Email: [email protected]” Output: “Email:”, “[email protected]

◦ N-Gram Tokenizer: Reads the field text and generates n-gram tokens of sizes in the given range (default minimum is 1 and maximum is 2) Input: “hello world” Output: “h”, “e”, “l”, “l”, “o”, “ “, “w”, “o”, “r”, “l”, “d”, “he ”, “el”, “ll”, “lo”, “o “, “wo”, “or”, “rl”, “ld”

Page 15: CSE 8337   SPRING 2011  PROJECT 3

Filters Filters take tokens as input from the Tokenizers and

produce another stream of tokens as output Multiple filters can be used one after the other Example:<fieldType name="text" class="solr.TextField">

<analyzer> <tokenizer class="solr.StandardTokenizerFactory"/><filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/><filter class="solr.EnglishPorterFilterFactory"/></analyzer>

</fieldType>

Page 16: CSE 8337   SPRING 2011  PROJECT 3

Types of Filters

Page 17: CSE 8337   SPRING 2011  PROJECT 3

Result of Filter application

Page 18: CSE 8337   SPRING 2011  PROJECT 3

Stop Words Handling Stop Filter: This filter is used to discard tokens that are on the

given stop words list. A standard stop words list is included in the SOLR config directory, named stopwords.txt, for English language text

Example: Using the standard stopwords.txt<analyzer>

<tokenizer class="solr.StandardTokenizerFactory"/><filter class="solr.StopFilterFactory" words="stopwords.txt"/>

</analyzer> Tokenizer Input : “welcome to the world of Solr”Tokenizer Output/Filter Input: “welcome”(1), “to”(2),

“the”(3), “world”(4), “of”(5), “Solr”(6)Filter Output: “welcome”(1), “world”(2), “Solr”(3)

Page 19: CSE 8337   SPRING 2011  PROJECT 3

stopwords.txt

Page 20: CSE 8337   SPRING 2011  PROJECT 3

Synonym Handling Synonym Filter: This is used for finding synonyms at the time of indexing as

well as while querying. Tokens are looked up in the list of synonyms and if a match is found, then the synonyms are put in place of the token

Example: We can define the synonyms in a file (test_synonyms.txt) and use it for comparing the tokens ◦ home, dwelling, house◦ shop => workshop, store◦ teh => the<analyzer> <tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.SynonymFilterFactory" synonyms=“test_synonyms.txt"/>

</analyzer> Tokenizer Input : “teh home shop”Tokenizer Output/Filter Input: “teh”(1), “home”(2), “shop”(3)Filter Output: “the”(1), “workshop”(2), “shop”(2), “home”(2),

“dwelling”(3), “house”(3)

Page 21: CSE 8337   SPRING 2011  PROJECT 3

INDEXING

Page 22: CSE 8337   SPRING 2011  PROJECT 3

Indexing Refers to adding the content to a SOLR index To make the content searchable Sources of data for indexing:

◦ XML◦ CSV◦ Rich text formats (PDF, MS Word, MS Excel, text etc.)◦ Data extracted from tables in a database

Page 23: CSE 8337   SPRING 2011  PROJECT 3

Posting Data to SOLR Uploading Data with SOLR Cell

◦ Using ExtractingRequestHandler◦ With a POST◦ With SOLR Cell and SOLRJ

Uploading Data with Index Handlers◦ XMLUpdateRequestHandler for XML-formatted Data◦ Using the CSVRequestHandler for CSV Content◦ Indexing Using SOLRJ

Uploading Structure Data Store Data with the Data Import Handler

Content Streams

Page 24: CSE 8337   SPRING 2011  PROJECT 3

cURL Utility curl posts and retrieves data over HTTP, FTP, and many other protocols In the example below, the Extraction Request Handler is called, uploads the file

tutorial.html and assigns it the unique ID doc1 curl “http://localhost:8983/solr/update/extract?

literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true” -F "[email protected]"

literal.id provides a unique ID to the document uploaded to SOLR

commit=true makes the document searchable after indexing

The -F flag instructs curl to POST data using the Content-Type multipart/form-data and supports the uploading of binary files

The @ symbol instructs curl to upload the attached file

The argument [email protected] needs a valid file path

Page 25: CSE 8337   SPRING 2011  PROJECT 3

Example – XMLUpdateRequestHandler

Order of operation:1. Modify the schema.xml file to add the fields which may not be already existing in the schema.xml file, example:

authors, dd, isbn, yearpub, publisher2. Modify the schema.xml file to copy the newly created fields to text field to make the search results viewable3. Run the curl utility with the command for adding XML document:

curl http://localhost:8983/solr/update -H "Content-Type: text/xml" --data-binary "<add><doc><field name='id'>doc26</field><field name='authors'>Patrick Eagar</field><field name='subject'>Sports</field><field name='dd'>796.35</field><field name='isbn'>0002166313</field><field name='yearpub'>1982</field><field name='publisher'>Collins</field></doc><commit waitFlush='false' waitSearcher='false'/></add>"

Page 26: CSE 8337   SPRING 2011  PROJECT 3

Uploading Structure Data Store Data with the Data Import Handler Often data is stored in relational databases Data Import Handler (DIH) provides a mechanism to

import data from database and to index it DIH can also index content from RSS and ATOM feeds,

e-mail repositories and structured XML

Page 27: CSE 8337   SPRING 2011  PROJECT 3

Configuration Handler to be registered in the solrconfig.xml file<requestHandler name="/dataimport"

class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str

name="config">${solr.config.dir:./solr/conf}/dataimporthandler/data-config.xml</str>

</lst> </requestHandler> There can be multiple configuration files

Page 28: CSE 8337   SPRING 2011  PROJECT 3

DIH Example1. Create a database in SQL Server 20052. The tables and the relationships in the database are shown below

Page 29: CSE 8337   SPRING 2011  PROJECT 3

DIH Example3. Create an XML file called DIH_Test.xml for importing into SOLR

4. Modify solrconfig.xml file to instruct SOLR to import data as per the file DIH_Test.xml

Page 30: CSE 8337   SPRING 2011  PROJECT 3

DIH Example5. Do a full-import of the DIH from the browser

using:

http://localhost:8983/solr/dataimport?command=full-import

Page 31: CSE 8337   SPRING 2011  PROJECT 3

DIH Example7. Run queries on the newly

indexed data from the database

8. Example:http://localhost:8983/solr/select?q=ipad2

The above query returns the result. Executing queries on the original database returns similar results

Page 32: CSE 8337   SPRING 2011  PROJECT 3

DIH Example – Multiple Datasources

Page 33: CSE 8337   SPRING 2011  PROJECT 3

QUERY FORMAT AND MATCHING DOCUMENT TO A

QUERY

Page 34: CSE 8337   SPRING 2011  PROJECT 3

Searching in SOLR

Request Handler

Query Parser

IndexResponse Writer

qt: selects a Request Handler for a query using /select

defType: selects a query parser for the query

qf: selects which field to query in the index

start: specifies an offset into the query results where the returned response should begin

rows: specifies the number of rows to be displayed at run time

fq: flters the query by applying an additional query to the initial query’s results; caches the results

wt: selects a response writer for formatting the query response

Page 35: CSE 8337   SPRING 2011  PROJECT 3

Advantage - Enables the user to specify very precise queries Disadvantage – Is less tolerant of syntax errors than the DisMax query

parser Parameters Supported

◦ Terms – Use of wild card characters, Fuzzy Searches, Boosts and Ranges◦ Fields – Identified by name followed by a colon◦ Boolean Operators – AND, OR, NOT, &&, !, ||◦ Common query parameters – debugQuery, defType, explainOther, fl, fq,

omitHeader, rows, sort, start, timeAllowed◦ Functions – abs, constant, div, fieldValue, log, linear, max, etc.◦ Faceting ◦ Highlighting ◦ MoreLikeThis (mlt)

Query Syntax and Parsing - The Standard Query Parser

Page 36: CSE 8337   SPRING 2011  PROJECT 3

q – Defines a query using standard query syntax. This parameter is mandatory

q.op – Specifies the default operator for query expressions (this parameter’s value is defined in schema.xml). Possible values are “AND” or “OR”

df – Specifies a default field, overriding the definition of a default field in schema.xml

Default parameter values are specified in solrconfig.xml

Standard Query Parser Parameters

Page 37: CSE 8337   SPRING 2011  PROJECT 3

Queryhttp://localhost:8983/solr/select?

q=id:6H500F0&popularity=6

Sample Responses - Example

Page 38: CSE 8337   SPRING 2011  PROJECT 3

Fuzzy Searches - based on the Levenshtein Distance or Edit Distance

E.g. tight~ will match terms like flight, slight etc. Additional parameter to specify degree of similarity –

tight~0.8 will match sight. When set closer to 1, optional parameter causes only terms with higher similarity to be matched

If numerical parameter is omitted, the default value taken is 0.5

Term Modifiers – To add flexibility and precision

Page 39: CSE 8337   SPRING 2011  PROJECT 3

Range Searches◦ Specifies a range(with an

upper and lower bound) of values for a field

◦ Can be inclusive or exclusive of the lower and upper bounds

Query:http://localhost:8983/solr/

select?q=popularity:{5 TO 7}

Term Modifiers – To add flexibility and precision

Page 40: CSE 8337   SPRING 2011  PROJECT 3

Parameter DescriptiondefType Query parser to be used (DisMax or Standard

Query Parser)Sort Sorts the response to a query in asc or desc

order based on response’s score or other characteristic

Start Offset into the responses at which solr should begin displaying content

Rows Number of rows of responses displayed at a time

fq Filter query for search results

fl Limits responses to a listed set of fields

Common Query Parameters

Page 41: CSE 8337   SPRING 2011  PROJECT 3

Parameter Description

debugQuery Include debugging information

timeAllowed Time allowed for a query to be processed. If time elapses before response is complete are returned, partial information returned

omitHeader Excludes header information from returned results

wt Specifies the response writer

Common Query Parameters

Page 42: CSE 8337   SPRING 2011  PROJECT 3

Used to generate a relevancy score using the actual value of one or more numeric fields

Functions available for function queries◦ abs – abs(x); abs(-5)◦ constant - 1.5; _val_:1.5◦ div – div(1,y); div(sum(x,100), max(y,1))◦ linear – linear(x, m, c); linear(x, 2, 4) returns 2*x+4◦ log – log(x); log(sum(x,100))◦ …

Include function query in a SOLR query◦ With a _val_keyword – e.g. _val_:myNumericField◦ Parameter with an explicit type of FunctionQuery (DisMax query parser’s

bf parameter)

Function Queries

Page 43: CSE 8337   SPRING 2011  PROJECT 3

http://localhost:8983/solr/select/?q=cat:electronics+_val_:”div(price,weight)”&fl=*,score

Function Query - Example

Page 44: CSE 8337   SPRING 2011  PROJECT 3

Generated a formatted response of a search wt parameter sets the response writer Response writers supported

◦ Json◦ Php◦ Phps◦ Python◦ Ruby◦ Xml◦ xslt

Response Writers

Page 45: CSE 8337   SPRING 2011  PROJECT 3

http://wiki.apache.org/solr/FrontPage(link last accessed on 04/25/2011)

Lucid Works SOLR Reference Guide 1.4 http://www.lucidimagination.com/user_download/certified/cdrg/lucidworks-solr-refguide-1.4.pdf

(link last accessed on 04/25/2011)

Bibliography