Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval...
-
Upload
marvin-parsons -
Category
Documents
-
view
227 -
download
1
Transcript of Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval...
Chapter 4Query Languages
..
.
Introduction Cover different kinds of queries
posed to text retrieval systems Keyword-based query languages
include simple words and phrases as well as Boolean operators
Pattern matching complement keyword searching with
data retrieval capabilities Structural queries
querying on structure of text
Keyword-Based Querying Query is formulation of user information
need Keyword-based queries are popular
intuitive easy to express allow for fast ranking
Query can be simply a word in general more complex combination of
operations involving several words
Single-Word Queries Most elementary query is a word Word is sequence of letters surrounded by
separators some characters are not letters but do not split
a word, e.g. hyphen in on-line Result of word queries is
set of documents containing at least one of the words in query
resulting documents are ranked term frequency (count of word in document) inverse document frequency (count of no. of
documents in which word appears)
Context Queries Complement single-word queries
with ability to search words in given context, I.e. near other words
words near each other signal higher likelihood of relevance than if they appear apart
form phrases of words or find words which are proximal in text
Phrase Sequence of single-word queries for instance, possible to search for
word ‘enhance’ and then word ‘retrieval’
uninteresting words in text are not considered at all e.g. above example query could match
text such as ‘…enhance the retrieval…’
Proximity Sequence of single words or phrases
is given together with maximum allowed distance between them
For instance, above query stated as ‘enhance’ and ‘retrieval’ should occur
within four words a possible match could be ‘… enhance the
power of retrieval…’ Distance can be measured in characters
or words
Boolean Queries Oldest form of combining keyword
queries is to use Boolean operators Boolean query has following syntax
atoms (I.e. basic queries) that retrieve documents, and of
Boolean operators which work on their operands (sets of documents)
query syntax tree can be defined leaves are basic queries internal nodes are operators
Boolean Queries (Cont.)
AND
translation OR
Retrieve all documents that contain the word ‘translation’ as well as either the word ‘syntax’ or the word ‘syntactic’
syntax syntactic
Boolean Queries (Cont.) No ranking of retrieved documents
provided document either satisfies query
(retrieved) or does not (not retrieved) does not allow partial matching between
document and user query to overcome this limitation, idea of ‘fuzzy
Boolean’ set of operators proposed instead of all the operands (AND) or at
least in one of operands (OR), retrieve elements in some operands
Natural Language Distinction between AND and OR
completely blurred simply an enumeration of words and
context queries all documents matching portion of user
query are retrieved higher ranking assigned to documents
matching more parts of query eliminated any reference to Boolean
operators
Pattern Matching Query formulation based on concept of pattern that allow retrieval of pieces of text that have some property
Pattern is set of syntactic features that occur in text segment
Segments satisfying pattern specification said to ‘match’ the pattern
We are interested in documents containing segments that match given search pattern
Pattern Matching (Cont.) Most used types of pattern are
words string (sequence of characters) that is a word
in text
prefixes string that form beginning of text word prefix ‘comput’ retrieve documents with words
such as ‘computer’, ‘computation’
suffixes string that form termination of word suffix ‘ters’ retrieve documents with words
such as ‘testers’, ‘computers’
substrings string which can appear within word substring ‘tal’ retrieve documents with words
such as ‘coastal’, ‘talk’, ‘metallic’
ranges pair of strings that match any word lying
between them in lexicographical order alphabets sorted to order string into
lexicographical order (dictionary order) range between words ‘held’ and ‘hold’ retrieve
strings such as ‘hoax’, ‘hissing’
allowing errors word together with error threshold retrieves all text words ‘similar’ to given word pattern may have errors (typing, spelling) and
documents with words with erroneous variants are retrieved (with edit distance)
if typing error splits ‘flower’ into ‘flo wer’, still found with one error
regular expression (r.e.) r.e. is built up by simple strings and operators
like union, concatenation and repetition query like ‘pro (plem | tein) (s | ) (0 | 1 | 2)*’
will match words like ‘problem02’, ‘proteins’
Extended patterns subset of regular expressions expressed with
simpler syntax classes of characters, I.e. some position in
pattern matched by any character from pre-defined set (e.g. some characters must be digit, not a letter, vowel, etc.)
conditional expressions, I.e. part of pattern may or may not appear
wild characters, I.e. match any sequence in text (e.g. any word starts as ‘flo’ and ends with ‘ers’ which match ‘flowers’ as well as ‘flounders’
Structural Queries Allowing user to query texts based on
structure, and not content mixing contents and structure in queries
can pose powerful queries (much more expressive)
An example select set of documents that satisfy certain
constraints on content (using word, phrase, or patterns) and then
structural constraints expressed using containment, proximity, or chapters, sections present in documents
Types of structures
fixed structure
hypertext
hierarchical structure
Fixed Structure Document has fixed set of fields each field has some text inside some fields not present in all documents fields not allowed to nest or overlap retrieval activity restricted to specifying
that given pattern was to be found only in given fields
this model reasonable when text collection has fixed structure
Hypertext Retrieval from hypertext began as
navigational activity user manually traverse hypertext nodes
following links to search what he wanted not possible to query hypertext based on
its structure WebGlimpse - interesting proposal to
allow navigation plus ability to search by content in neighborhood of current node
Hierarchical Structure Represent recursive decomposition of
text natural model for many text collections Figure 4.3 shows example of hierarchical
structure that consists of page of a book, its schematic view and parsed query to retrieve figure
Hierarchical ModelsPAT Expressions Structure marked in the text by tags (as
in HTML) defined in terms of initial and final tags
each pair of initial and final tags defines a region, set of contiguous text areas area of region cannot nest or overlap
possible to select areas containing other areas, contained in other areas, or followed by other areas
Overlapped Lists Allows area of regions to overlap, but not
to nest considers use of inverted list where
words and also regions are indexed allows to perform set union, and to
combine regions
List of References Attempt to make definition and querying
of structured text uniform, using common language
the language allows for querying on ‘path expressions’, which describe paths in structure tree
answers to queries are list of ‘references’ reference is pointer to region of database
Proximal Nodes Tries to find good compromise between
expressiveness and efficiency specifies fully compositional language
where leaves of query syntax tree formed by basic queries on contents or names of structural elements (e.g. all chapters)
internal nodes combine results for efficiency, operations at internal
nodes must relate nodes close in text
Tree Matching Relies on single primitive: tree inclusion interpret structure of text database and
query as trees determine embedding of query into
database which respects hierarchical relationships between nodes of query
simple queries return roots of the matches