Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary...

22
Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates [email protected] m (952) 931-9198 M D Metadata Solutions

Transcript of Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary...

Page 1: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Keyword SearchingWeighted Federated Search with Key Word in Context

Date: 10/2/2008

Dan McCrearyPresidentDan McCreary & [email protected](952) 931-9198

M

D

Metadata Solutions

Page 2: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 2

M

D

Acknowledgements

• Joe Wicentowski wrote the original keyword search examples

• Joe’s work was based on the KWIC code done by Wolfgang Meier

Page 3: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 3

M

D

Note About Example and Functions

• In an actual production system the code would be modularized into a series of functions

• This example has the functions intentionally removed to make the process easier to view

• A functionalized version will also be available for students to use in their production applications

Page 4: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 4

M

D

Motivation

• You have a large complex web site with many heterogeneous data collections– people, blogs, news stories, event calendar etc

• Want a single search function that will find any item in any of these collections

• Each item has different:– Collection– Title– Item Viewer Function

Page 5: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 5

M

D

Heterogeneous Items in a Collection

• Search results come back as heterogeneous items in a sequence• Each hit item has a different structure• Each hit item has a document type and the title is consistently at the same XPath

expression for each item type

t

t

t tt

sequence of hit items

title element

hit item

person

blog

blog country person blog

Page 6: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 6

M

D

Detailed Steps

• Gather search keywords

• Construct scope (collections)

• Execute query (generate hits)

• Score and sort

• Prepare summary results for top hits

• Display top results

Page 7: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 7

M

D

Basic Search Algorithm

let $q := get-parameter(“q”, “”)

for $hit in $collection-list/type [$hit contains($hit, $q)] return $hit

1. Get the search query2. Find the documents that match [ ] is like the SQL

where statement3. Return a short summary of the matching

documents

pseudo-code

Page 8: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 8

M

D

Collection Paths and Predicate

for $hit in (collection('/db/test/articles')/article/body,

collection('/db/test/people')/person/biography) [. &= $q]

In a production system the list of collections would bestored in an XML file and a function would return asequence of the the collections

Page 9: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 9

M

D

Sample HTML Search Form<html><head><title>Keyword

Search</title></head><body> <h1>Keyword Search</h1> <form method="GET“

action=“search.xq”> <p> <strong>Keyword Search:</strong> <input name="q" type="text"/> </p> <p> <input type="submit"

value="Search"/> </p> </form></body></html>

The path to XQueryREST servicethat your form uses

Page 10: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 10

M

D

Protection against injection attacks

let $q := xs:string(request:get-parameter("q", ""))

let $filtered-q :=

replace($q,

"[&amp;&quot;-*;-`~!@#$%^*()_+=\[\]\{\}\|';:/.,?(:]",

"")

This will remove any characters from the input querythat might contain characters any special charactersthat could be used as SQL injection attacks.

Page 11: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 11

M

D

Create a Scope Sequence

let $scope := (collection('/db/test/articles')/article/body, collection('/db/test/people')/people/person/biography)

A scope is the list of all the items that you will query against.

Note that we will usually replace this “inline” scope variable with a function xrx:get-searchable-collections() to search for all collections in the future

Page 12: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 12

M

D

Scoring Each Hit

let $keyword-matches := text:match-count($hit)

let $hit-node-length := string-length($hit)

let $score := $keyword-matches div $hit-node-length

text:match-count() is the number of times a hit matches a keyword hit. If a document has five occurrences of the keywords the match count would return 5.

Once you have the sequence of hits, you can now score each of the hits and return a new sequence of the top scoring hits.

In the example above the score is the number of matches within the document divided by the total length of the document (in this case the total number of characters in the file).

Page 13: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 13

M

D

Score and Sort

let $sorted-hits := for $hit in $hits let $keyword-matches := text:match-count($hit) let $hit-node-length := string-length($hit) let $score := $keyword-matches div $hit-node-length order by $score descending return $hit

Once you have the sequence of hits, you can now scoreeach of the hits and return a list of the top scoring hits

Page 14: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 14

M

D

Result Pagination

let $perpage := xs:integer(request:get-parameter("perpage", "10"))

let $start := xs:integer(request:get-parameter("start", "0"))

let $end := $start + $perpage

let $results := for $hit in $sorted-hits[$start to $end]

The remainder of our example deals with iterating through the results N records at time where N is the number of results per page ($perpage).

In this case $perpage and $start are both optional parameters to our search query.

$end is the sum of the start and the number per page.

Adding the [$start to $end] to a new query is the same as performing a subsequence() operation on the sorted hist to get the final $result sequence to display on the page.

Page 15: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 15

M

D

Showing Results

• With Highlighted Keyword in Context

• We want to show each result as an HTML div element containing 3 components:– The document title– a summary with an excerpt of the hit showing

the keywords highlighted in context– and a link to display the full document

Page 16: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 16

M

D

Extracting the Collection and Document

let $collection := util:collection-name($hit)

let $document := util:document-name($hit)

We did not need to keep track of the original collection and document that the hit came from because we can always find the collection and document using the these two functions.

Page 17: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 17

M

D

KWIC Functions

• let $summary := kwic:summarize($hit, $config)

Page 18: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 18

M

D

Displaying the Keyword in Context

The word or words you used in your search should be highlighted in the context of the search results. You can customize how much of the surrounding text you want to display.

Page 19: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 19

M

D

Calculating number of pageslet $perpage := xs:integer(request:get-parameter("perpage", "10"))let $start := xs:integer(request:get-parameter("start", "0"))let $total-result-count := count($hits)let $end :=

if ($total-result-count lt $perpage) then $total-result-count else $start + $perpagelet $number-of-pages := xs:integer(ceiling($total-result-count div $perpage))let $current-page := xs:integer(($start + $perpage) div $perpage)

Page 20: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 20

M

D

Managing Federated Search

• Each application you use needs to communicate the following items to the federated search tool:– Collection name– Collection data path– Collection document path– Collection title path– Collection id path– Collection viewer path

Page 21: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 21

M

D

Sample App Config File

<app-info> <app-name>Articles</app-name> <app-path>/db/test/articles</app-path> <doc-path>article/body</doc-path> <doc-title-path>article/title/text()</doc-title-path> <doc-id>article/id/text()</doc-id> <doc-viewer>/db/test/articles/views/view-article.xq?id=</doc-viewer>

</app-info>

If you create a file called app-info.xml in each collection that you want to search on you can create dynamically create a list of applications that you want to search. If you do this you can automate the installation of interoperable applications.

Page 22: Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates dan@danmccreary.com.

Copyright 2008 Dan McCreary & Associates 22

M

D

Thank You!

Please contact me for more information:• Native XML Databases• Metadata Management• Metadata Registries• Service Oriented Architectures• Business Intelligence and Data Warehouse• Semantic Web

Dan McCreary, PresidentDan McCreary & Associates

Metadata Strategy [email protected]

(952) 931-9198