Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary...
-
Upload
valentine-beasley -
Category
Documents
-
view
215 -
download
0
Transcript of Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary...
Keyword SearchingWeighted Federated Search with Key Word in Context
Date: 10/2/2008
Dan McCrearyPresidentDan McCreary & [email protected](952) 931-9198
M
D
Metadata Solutions
Copyright 2008 Dan McCreary & Associates 2
M
D
Acknowledgements
• Joe Wicentowski wrote the original keyword search examples
• Joe’s work was based on the KWIC code done by Wolfgang Meier
Copyright 2008 Dan McCreary & Associates 3
M
D
Note About Example and Functions
• In an actual production system the code would be modularized into a series of functions
• This example has the functions intentionally removed to make the process easier to view
• A functionalized version will also be available for students to use in their production applications
Copyright 2008 Dan McCreary & Associates 4
M
D
Motivation
• You have a large complex web site with many heterogeneous data collections– people, blogs, news stories, event calendar etc
• Want a single search function that will find any item in any of these collections
• Each item has different:– Collection– Title– Item Viewer Function
Copyright 2008 Dan McCreary & Associates 5
M
D
Heterogeneous Items in a Collection
• Search results come back as heterogeneous items in a sequence• Each hit item has a different structure• Each hit item has a document type and the title is consistently at the same XPath
expression for each item type
t
t
t tt
sequence of hit items
title element
hit item
person
blog
blog country person blog
Copyright 2008 Dan McCreary & Associates 6
M
D
Detailed Steps
• Gather search keywords
• Construct scope (collections)
• Execute query (generate hits)
• Score and sort
• Prepare summary results for top hits
• Display top results
Copyright 2008 Dan McCreary & Associates 7
M
D
Basic Search Algorithm
let $q := get-parameter(“q”, “”)
for $hit in $collection-list/type [$hit contains($hit, $q)] return $hit
1. Get the search query2. Find the documents that match [ ] is like the SQL
where statement3. Return a short summary of the matching
documents
pseudo-code
Copyright 2008 Dan McCreary & Associates 8
M
D
Collection Paths and Predicate
for $hit in (collection('/db/test/articles')/article/body,
collection('/db/test/people')/person/biography) [. &= $q]
In a production system the list of collections would bestored in an XML file and a function would return asequence of the the collections
Copyright 2008 Dan McCreary & Associates 9
M
D
Sample HTML Search Form<html><head><title>Keyword
Search</title></head><body> <h1>Keyword Search</h1> <form method="GET“
action=“search.xq”> <p> <strong>Keyword Search:</strong> <input name="q" type="text"/> </p> <p> <input type="submit"
value="Search"/> </p> </form></body></html>
The path to XQueryREST servicethat your form uses
Copyright 2008 Dan McCreary & Associates 10
M
D
Protection against injection attacks
let $q := xs:string(request:get-parameter("q", ""))
let $filtered-q :=
replace($q,
"[&"-*;-`~!@#$%^*()_+=\[\]\{\}\|';:/.,?(:]",
"")
This will remove any characters from the input querythat might contain characters any special charactersthat could be used as SQL injection attacks.
Copyright 2008 Dan McCreary & Associates 11
M
D
Create a Scope Sequence
let $scope := (collection('/db/test/articles')/article/body, collection('/db/test/people')/people/person/biography)
A scope is the list of all the items that you will query against.
Note that we will usually replace this “inline” scope variable with a function xrx:get-searchable-collections() to search for all collections in the future
Copyright 2008 Dan McCreary & Associates 12
M
D
Scoring Each Hit
let $keyword-matches := text:match-count($hit)
let $hit-node-length := string-length($hit)
let $score := $keyword-matches div $hit-node-length
text:match-count() is the number of times a hit matches a keyword hit. If a document has five occurrences of the keywords the match count would return 5.
Once you have the sequence of hits, you can now score each of the hits and return a new sequence of the top scoring hits.
In the example above the score is the number of matches within the document divided by the total length of the document (in this case the total number of characters in the file).
Copyright 2008 Dan McCreary & Associates 13
M
D
Score and Sort
let $sorted-hits := for $hit in $hits let $keyword-matches := text:match-count($hit) let $hit-node-length := string-length($hit) let $score := $keyword-matches div $hit-node-length order by $score descending return $hit
Once you have the sequence of hits, you can now scoreeach of the hits and return a list of the top scoring hits
Copyright 2008 Dan McCreary & Associates 14
M
D
Result Pagination
let $perpage := xs:integer(request:get-parameter("perpage", "10"))
let $start := xs:integer(request:get-parameter("start", "0"))
let $end := $start + $perpage
let $results := for $hit in $sorted-hits[$start to $end]
The remainder of our example deals with iterating through the results N records at time where N is the number of results per page ($perpage).
In this case $perpage and $start are both optional parameters to our search query.
$end is the sum of the start and the number per page.
Adding the [$start to $end] to a new query is the same as performing a subsequence() operation on the sorted hist to get the final $result sequence to display on the page.
Copyright 2008 Dan McCreary & Associates 15
M
D
Showing Results
• With Highlighted Keyword in Context
• We want to show each result as an HTML div element containing 3 components:– The document title– a summary with an excerpt of the hit showing
the keywords highlighted in context– and a link to display the full document
Copyright 2008 Dan McCreary & Associates 16
M
D
Extracting the Collection and Document
let $collection := util:collection-name($hit)
let $document := util:document-name($hit)
We did not need to keep track of the original collection and document that the hit came from because we can always find the collection and document using the these two functions.
Copyright 2008 Dan McCreary & Associates 17
M
D
KWIC Functions
• let $summary := kwic:summarize($hit, $config)
Copyright 2008 Dan McCreary & Associates 18
M
D
Displaying the Keyword in Context
The word or words you used in your search should be highlighted in the context of the search results. You can customize how much of the surrounding text you want to display.
Copyright 2008 Dan McCreary & Associates 19
M
D
Calculating number of pageslet $perpage := xs:integer(request:get-parameter("perpage", "10"))let $start := xs:integer(request:get-parameter("start", "0"))let $total-result-count := count($hits)let $end :=
if ($total-result-count lt $perpage) then $total-result-count else $start + $perpagelet $number-of-pages := xs:integer(ceiling($total-result-count div $perpage))let $current-page := xs:integer(($start + $perpage) div $perpage)
Copyright 2008 Dan McCreary & Associates 20
M
D
Managing Federated Search
• Each application you use needs to communicate the following items to the federated search tool:– Collection name– Collection data path– Collection document path– Collection title path– Collection id path– Collection viewer path
Copyright 2008 Dan McCreary & Associates 21
M
D
Sample App Config File
<app-info> <app-name>Articles</app-name> <app-path>/db/test/articles</app-path> <doc-path>article/body</doc-path> <doc-title-path>article/title/text()</doc-title-path> <doc-id>article/id/text()</doc-id> <doc-viewer>/db/test/articles/views/view-article.xq?id=</doc-viewer>
</app-info>
If you create a file called app-info.xml in each collection that you want to search on you can create dynamically create a list of applications that you want to search. If you do this you can automate the installation of interoperable applications.
Copyright 2008 Dan McCreary & Associates 22
M
D
Thank You!
Please contact me for more information:• Native XML Databases• Metadata Management• Metadata Registries• Service Oriented Architectures• Business Intelligence and Data Warehouse• Semantic Web
Dan McCreary, PresidentDan McCreary & Associates
Metadata Strategy [email protected]
(952) 931-9198