An Introduction computer scientist -...
Transcript of An Introduction computer scientist -...
![Page 1: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/1.jpg)
Information Retrieval
– An Introduction –
– The view of an open-minded –– computer scientist –
www.studentsfocus.com
![Page 2: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/2.jpg)
What is Information Retrieval ?
• The process of actively seeking out information relevant to a topic of interest (van Rijsbergen)
– Typically it refers to the automatic (rather than manual) retrieval of documents• Information Retrieval System (IRS)
– “Document” is the generic term for an information holder (book, chapter, article, webpage, etc)
www.studentsfocus.com
![Page 3: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/3.jpg)
IR in practice• Information Retrieval is a research-driven theoretical
and experimental discipline– The focus is on different aspects of the information–seeking
process, depending on the researcher’s background or interest:• Computer scientist – fast and accurate search engine• Librarian – organization and indexing of information• Cognitive scientist – the process in the searcher’s mind• Philosopher – Is this really relevant ?• …
– Progress influenced by advances in Computational Linguistics, Information Visualization, Cognitive Psychology, HCI, …
• Experimental vs. operational systems• Analogy to car manufacturingwww.studentsfocus.com
![Page 4: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/4.jpg)
Fundamental concepts in IR
• What is information ?
• Meaning vs. form
• Data vs. Information Retrieval
• Relevance
www.studentsfocus.com
![Page 5: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/5.jpg)
Disclaimer• Relevance and other key concepts in IR were
discussed in the previous class, so we won’t do it again. – We’ll take a simple view: a document is relevant if it is
about the searcher’s topic of interest
• We will discuss text documents, not other media– Most current tools that search for images, video, or
other media rely on text annotations– Real content retrieval of other media (based on shape,
color, texture, …) are not mature yet
www.studentsfocus.com
![Page 6: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/6.jpg)
The stages of IR
Indexedand structuredinformation
Information
Retrieval• Searching• Browsing
Indexing,organizing
Creation
www.studentsfocus.com
![Page 7: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/7.jpg)
The formalized IR process
Collection of documents
Real world
Document representations Query
Information need
Anomalous state of knowledge
Matching
Resultswww.studentsfocus.com
![Page 8: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/8.jpg)
What do we want from an IRS ?
• Systemic approach– Goal (for a known information need):
Return as many relevant documents as possible and as few non-relevant documents as possible
• Cognitive approach– Goal (in an interactive information-seeking
environment, with a given IRS):Support the user’s exploration of the problem
domain and the task completion.www.studentsfocus.com
![Page 9: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/9.jpg)
The role of an IR system– a modern view –
• Support the user in– exploring a problem domain, understanding its
terminology, concepts and structure– clarifying, refining and formulating an information
need– finding documents that match the info need
description• As many relevant docs as possible• As few non-relevant documents as possible
www.studentsfocus.com
![Page 10: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/10.jpg)
How does it do this ?
• User interfaces and visualization tools for– exploring a collection of documents– exploring search results
• Query expansion based on– Thesauri– Lexical/statistic analysis of text / context and concept
formation– Relevance feedback
• Indexing and matching modelwww.studentsfocus.com
![Page 11: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/11.jpg)
How well does it do this ?• Evaluation
– Of the components• Indexing / matching algorithms
– Of the exploratory process overall• Usability issues• Usefulness to task• User satisfaction
www.studentsfocus.com
![Page 12: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/12.jpg)
Role of the user interface in IR
Problem definition
Source selection
Problem articulation
Examination of results
Extraction of information
Integration with overall task
INPUT
OUTPUT
Engine
www.studentsfocus.com
![Page 13: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/13.jpg)
Information Visualization toolsfor exploration
• Rely on some form of information organization
• Principle:– Overview first– Zoom– Details on demand
• Usability issues– Direct manipulation– Dynamic, implicit querieswww.studentsfocus.com
![Page 14: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/14.jpg)
Information Visualization tools
• Repositories– University of Maryland HCIL
• http://www.cs.umd.edu/projects/hcil
– InfoViz repository• http://fabdo.fh-potsdam.de/infoviz/repository.html
• Hyperbolic trees• Themescapes• Workscapes• Fisheye viewwww.studentsfocus.com
![Page 15: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/15.jpg)
Faceted organization• Each document is described by a set of attribute
(or facet) values• Example:
– FilmFinder, HomeFinder– Film
• Attributes (facets): Title, Year, Popularity, Director, Actors
• In design terms, it refers to composition.
www.studentsfocus.com
![Page 16: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/16.jpg)
Hierarchic organization
Computing
Computer
Screen Keyboard C++Pascal
Programming language
...Mathematics
...
Algebra
Computing, Mathematics Physics
Science
Role of structure:• support for exploration (browsing / searching)• support for term disambiguation• potential for efficient retrieval
In design terms it refers to inheritance.www.studentsfocus.com
![Page 17: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/17.jpg)
Structuring a document collection
• Manual, by experts - slow, expensive, infeasible for large corpora
• Supervised categorization• Classes or hierarchic structure established by human
experts• Documents automatically allocated to classes
• Unsupervised classification = clustering• Similar documents grouped together, and a structure is
expected to emerge• Result influenced by the homogeneity/heterogeneity of
the documents, by the indexing and clustering methods and parameters
www.studentsfocus.com
![Page 18: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/18.jpg)
www.studentsfocus.com
![Page 19: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/19.jpg)
Document Clustering
• Finds overall similarities among groups of documents
• Finds overall similarities among groups of documents
• Picks out some themes, ignores otherswww.studentsfocus.com
![Page 20: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/20.jpg)
The Cluster Hypothesis
• “Similar documents tend to be relevant to the same requests”
• Issues:– Variants: “Documents that are relevant to the
same topics are similar”– Simple vs. complex topics– Evaluation, prediction
• The cluster hypothesis is the main motivation behind document clustering
www.studentsfocus.com
![Page 21: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/21.jpg)
Document-document similarity
• Document representative– Select features to characterize document: terms,
phrases, citations– Select weighting scheme for these features:
• Binary, raw/relative frequency, divergence measure• Title / body / abstract, controlled vocabulary, selected topics,
taxonomy
• Similarity / association coefficient or dissimilarity / distance metricwww.studentsfocus.com
![Page 22: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/22.jpg)
Similarity coefficients
• Simple matching
• Dice’s coefficient
• Cosine coefficient
¦i
iiyx
¦ ¦¦�
�
i iii
iii
yxyx
22
2
¦ ¦¦
�
�
i iii
iii
yx
yx22YX
YX��
YXYX
���2
YX �
www.studentsfocus.com
![Page 23: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/23.jpg)
Clustering methods
• Non-hierarchic methods=> partitions– High efficiency, low effectiveness
• Hierarchic methods=> hierarchic structures - small clusters of highly
similar documents nested within larger clusters of less similar documents
– Divisive => monothetic classifications– Agglomerative => polythetic classifications !!
www.studentsfocus.com
![Page 24: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/24.jpg)
Partitioning method
• Generic procedure:– The first object becomes the first cluster– Each subsequent object is matched against
existing clusters• It is assigned to the most similar cluster if the similarity
measure is above a set threshold• Otherwise it forms a new cluster
– Re-shuffling of documents into clusters can be done iteratively to increase cluster similarity
www.studentsfocus.com
![Page 25: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/25.jpg)
HACM’s• Generic procedure:
– Each doc to be clustered is a singleton cluster– While there is more than one cluster, the clusters with
maximum similarity are merged and the similarities recomputed
• A method is defined by the similarity measure between non-singleton clusters
• Algorithms for each method differ in:– Space (store similarity matrix ? all of it ?)– Time (use all similarities ? use inverted files ?)
www.studentsfocus.com
![Page 26: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/26.jpg)
Representation of clustered hierarchies
www.studentsfocus.com
![Page 27: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/27.jpg)
Scatter/Gather• How it works
– Cluster sets of documents into general “themes”, like a table of contents
– Display the contents of the clusters by showing topical terms and typical titles
– User chooses subsets of the clusters and re-clusters the documents within
– Resulting new groups have different “themes”
• Originally used to give collection overview
• Evidence suggests more appropriate for displaying retrieval results in contextwww.studentsfocus.com
![Page 28: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/28.jpg)
www.studentsfocus.com
![Page 29: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/29.jpg)
Multi-Dimensional Metaphor for the Document Space
www.studentsfocus.com
![Page 30: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/30.jpg)
www.studentsfocus.com
![Page 31: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/31.jpg)
Search strategies• Analytical strategy (mostly querying)
– Analyze the attributes of the information need and of the problem domain (mental model)
• Browsing– Follow leads by association (not much planning)
• Known site strategy– Based on previous searches– Indexes or starting points for browsing
• Similarity strategy– “more like this”www.studentsfocus.com
![Page 32: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/32.jpg)
Non-search activities
• Reading and interpreting
• Annotating or summarizing
• Analysis– Finding trends– Making comparisons– Aggregating information– Identifying a critical subsetwww.studentsfocus.com
![Page 33: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/33.jpg)
IRS design trade-offs(high-level)
• General– Easy to learn (“walk up and use”)
• Intuitive• Standardized look-and-feel and functionality
– Simple and easy to use– Deterministic and restrictive
• Specialized– Complex, require training (course, tutorial)– Increased functionality– Customizable, non-deterministicwww.studentsfocus.com
![Page 34: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/34.jpg)
Query specification• Boolean vs. free text
• Structure analysis vs. bag of words
• Phrases / proximity
• Faceted / weighted queries (TileBars, FilmFinder)
• Graphical support (Venn diagrams, filters)
• Support for query formulation (aid-word list, thesauri, spell-checking)www.studentsfocus.com
![Page 35: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/35.jpg)
Query Specification
• Interaction Styles – Command Language– Form Fillin– Menu Selection– Direct Manipulation– Natural Language
• Example:– How do each apply to Boolean Querieswww.studentsfocus.com
![Page 36: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/36.jpg)
Form-Based Query Specification (Altavista)
www.studentsfocus.com
![Page 37: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/37.jpg)
Form-based Query Specification (Infoseek)
www.studentsfocus.com
![Page 38: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/38.jpg)
www.studentsfocus.com
![Page 39: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/39.jpg)
Menu-based Query Specification(Young & Shneiderman 93)
www.studentsfocus.com
![Page 40: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/40.jpg)
Putting Results in Context
• Interfaces should – give hints about the roles terms play in the collection– give hints about what will happen if various terms are
combined– show explicitly why documents are retrieved in
response to the query– summarize compactly the subset of interest
www.studentsfocus.com
![Page 41: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/41.jpg)
KWIC (Keyword in Context)• An old standard, ignored by internet search engines
– used in some intranet engines, e.g., Cha-Cha
www.studentsfocus.com
![Page 42: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/42.jpg)
TileBars
www.studentsfocus.com
![Page 43: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/43.jpg)
The formalized IR process
Collection of documents
Real world
Document representations Query
Information need
Anomalous state of knowledge
Matching
Resultswww.studentsfocus.com
![Page 44: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/44.jpg)
Indexing
• Association of descriptors (keywords, concepts, metadata) to documents in view of future retrieval
• The knowledge / expectation / behavior of the searcher needs to be anticipated
www.studentsfocus.com
![Page 45: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/45.jpg)
Manual and automatic indexing• Manual
– Human indexers assign index terms to documents– A computer system may be used to record the descriptors
generated by the human
• Automatic– The system extracts “typical”/ “significant” terms– The human may contribute by setting the parameters or
thresholds, or by choosing components or algorithms
• Semi-automatic– The system’s contribution may be support in terms of word lists,
thesauri, reference system, etc, following or not the automatic processing of the textwww.studentsfocus.com
![Page 46: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/46.jpg)
Manual vs. automatic indexing• Manual
– Slow and expensive– Is based on intellectual judgment and semantic
interpretation (concepts, themes)– Low consistency
• Automatic– Fast and inexpensive– Mechanical execution of algorithms, with no intelligent
interpretation (aboutness / relevance)– Consistentwww.studentsfocus.com
![Page 47: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/47.jpg)
Vocabulary
• Vocabulary (indexing language)– The set of concepts (terms or phrases) that can be used to
index documents in a collection
• Controlled– Specific for specialized domains– Potential for increased consistency of indexing and precision
of retrieval
• Un-controlled (free)– Potentially all the terms in the documents– Potential for increased recall
www.studentsfocus.com
![Page 48: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/48.jpg)
Thesauri• Capture relationships between indexing terms
– Hierarchical– Synonymous– Related
• Creation of thesauri– Manual vs. automatic
• Use of thesauri– In manual / semi-automatic / automatic fashion– Syntagmatic co-ordination / thesaurus-based query
expansion during indexing / searching
www.studentsfocus.com
![Page 49: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/49.jpg)
Query indexing
• Search systems– Automatic indexing– Synchronization with indexing of documents
(vocabulary, algorithms, etc)
• Interactive / browsing systems– Support tools (word list, thesauri)– Query not necessarily explicit
www.studentsfocus.com
![Page 50: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/50.jpg)
Automatic indexing• Rationalist approach
– Natural Language processing / Artificial Intelligence– Attempts to define and use grammatical, knowledge
and reasoning rules (Chomsky)– More computationally intensive
• Empiricist approach– Statistical Language Processing– Estimate probabilities of linguistic events: words,
phrases, sentences (Shannon)– Inexpensive, but just as good
www.studentsfocus.com
![Page 51: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/51.jpg)
Automatic indexing• There is no “best solution”
• An “engineering” approach is taken: creatively combine theoretical models and techniques, test, make adjustments until the results are satisfying
• Balance between effort/sophistication of method and quality of results needed
• Results depend on the specific document collection and on the type of applicationwww.studentsfocus.com
![Page 52: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/52.jpg)
TEXT
REPRESENTATION
Lexical analysis
Stemming
Stopword removal
representation
Steps of automatic indexing
Collection/document structure
Data structure
www.studentsfocus.com
![Page 53: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/53.jpg)
wordfrequency
too frequent: useless discriminators
significant terms
too rare: no significant contribution tothe content of the document
Term significance
Word occurrence frequency is a measure for the significance of terms and their discriminatory power (see Brown corpus).
www.studentsfocus.com
![Page 54: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/54.jpg)
Weighting• Heuristics
– Based on common sense, but adjusted/engineered following experiments. Ex:
– Terms that occur in only a few documents are often more valuable than ones that occur in many - IDF
– The more often a term occurs in a document, the more likely it is to be important for that document -TF
– A term that occurs the same number of times in short document and in a long document is likely to be more valuable for the former - DL
www.studentsfocus.com
![Page 55: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/55.jpg)
Document weighting• Theoretical models
– Provide theoretical justification of the formulae– Take advantage of mathematical theory– Are typically adjusted by heuristics
• Probabilistic– Rank documents based on the estimated probability that they
are relevant to the query (derived from term counts)
• Language models– Rank documents based on the estimated probability that the
query is a random sample of document wordswww.studentsfocus.com
![Page 56: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/56.jpg)
Ranked retrieval
• The documents are ranked based on their score
• Advantages– Query easy to specify– The output is ranked based on the estimated
relevance of the documents to the query– A wide variety of theoretical models exist
• Disadvantages– Query less precise (although weighting can be used)
www.studentsfocus.com
![Page 57: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/57.jpg)
Boolean retrieval
• Documents are retrieved based on their containing or not query terms
• Advantages– Very precise queries can be specified– Very easy to implement (in the simple form)
• Disadvantages– Specifying the query may be difficult for casual users– Lack of control over the size of the retrieved setwww.studentsfocus.com
![Page 58: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/58.jpg)
IR Evaluation
• Why evaluate ?– “Quality”
• What to evaluate ?– Qualitative vs. quantitative measures
• How to evaluate ?– Experimental design; result analysis
• Complex and controversial topicwww.studentsfocus.com
![Page 59: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/59.jpg)
Actors involved
• Funders– Cost to implement, estimated savings– User satisfaction, public recognition
• Librarian, library scientist– Functionality– Support for search strategies– User satisfaction
www.studentsfocus.com
![Page 60: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/60.jpg)
Actors involved
• Information scientist, mathematician– Underlying mathematical model for representing
information– Weighting scheme, document-query matching– System effectiveness
• Computer scientist, software developer– System efficiency (speed, resources needed)– Flexibility, extensibility
www.studentsfocus.com
![Page 61: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/61.jpg)
Need to evaluate
• Technology hype or real need ?– Landauer, T. – “The trouble with computers”
• Does it justify its cost ?– Pros: review and improvement of procedures and
workflow; increased efficiency; increased control and safety
– Cons: actual cost of system; work interruption; need for re-training
– Quality – can it be improved ?www.studentsfocus.com
![Page 62: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/62.jpg)
History• Systemic approach
– User outside the system– Static/fixed information need– Retrieval effectiveness measured– Batch retrieval simulations
• User-centered approach– User part of the system, interacting with other
components, trying to resolve an anomalous state of knowledge
– Task-oriented evaluationwww.studentsfocus.com
![Page 63: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/63.jpg)
Aspects to evaluate
Problem definition
Source selection
Problem articulation
Examination of results
Extraction of information
Integration with overall task
INPUT
OUTPUT
Engine
www.studentsfocus.com
![Page 64: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/64.jpg)
Experimental design decisions
• Whole vs. parts
• Black box vs. diagnostic systems
• Operational vs. experimental system
www.studentsfocus.com
![Page 65: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/65.jpg)
One possible approach• IR-specific evaluation
– Systemic• Quality of search engine• Influence of various modelling decisions (stopword removal,
stemming, indexing, weighting scheme, …)– Interaction
• Support for query formulation• Support for exploration of search output
• Non-specific evaluation– Task-oriented evaluation
• Usefulness, usability• Task completion, user satisfaction
www.studentsfocus.com
![Page 66: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/66.jpg)
Laboratory vs. operational settings
• Laboratory– Typically only one or several components of the system are
evaluated– Assumptions are made about the other components– User behavior is typically simulated (software)– Control over experimental variables, repeatability, observability
• Operational – More or less “real” users– Real of inferred information needs– Realism www.studentsfocus.com
![Page 67: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/67.jpg)
The traditional (lab) IR experiment
• To start with you need:– An IR system (or two)– A collection of documents– A collection of requests– Relevance judgements
• Then you run your experiment:– Input the documents– Put each request to the system– Collect the outputwww.studentsfocus.com
![Page 68: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/68.jpg)
Retrieval effectiveness
Relevant
Retrieved
All docs
www.studentsfocus.com
![Page 69: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/69.jpg)
Precision vs. Recall
Relevant
Retrieved
|Collectionin Rel||edRelRetriev| Recall
|Retrieved||edRelRetriev| Precision
All docs
www.studentsfocus.com
![Page 70: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/70.jpg)
Interactive system’s evaluation
• Definition:Evaluation = the process of systematicallycollecting data that informs us about what it is like for a particular user or group of users to use a product/system for a particular task in a certain type of environment.
www.studentsfocus.com
![Page 71: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/71.jpg)
Problems
• Attitudes:– Designers assume that if they and their colleagues
can use the system and find it attractive, others will too• Features vs. usability or security
– Executives want the product on the market yesterday• Problems “can” be addressed in versions 1.x
– Consumers accept low levels of usability• “I’m so silly”
www.studentsfocus.com
![Page 72: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/72.jpg)
Two main types of evaluation• Formative evaluation is done at different stages
of development to check that the product meets users’ needs.– Part of the user-centered design approach– Supports design decisions at various stages– May test parts of the system or alternative designs
• Summative evaluation assesses the quality of a finished product. – May test the usability or the output quality– May compare competing systems
www.studentsfocus.com
![Page 73: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/73.jpg)
What to evaluateIterative design & evaluation is a continuous
process that examines:
• Early ideas for conceptual model • Early prototypes of the new system• Later, more complete prototypes
Designers need to check that they understand users’ requirements and that the design assumptions hold.
www.studentsfocus.com
![Page 74: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/74.jpg)
Four evaluation paradigms• ‘quick and dirty’
• usability testing
• field studies
• predictive evaluation
www.studentsfocus.com
![Page 75: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/75.jpg)
Quick and dirty
• ‘quick & dirty’ evaluation describes the common practice in which designers informally get feedback from users or consultants to confirm that their ideas are in-line with users’ needs and are liked.
• Quick & dirty evaluations are done any time.• The emphasis is on fast input to the design
process rather than carefully documented findings.
www.studentsfocus.com
![Page 76: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/76.jpg)
Usability testing• Usability testing involves recording typical users’
performance on typical tasks in controlled settings. Field observations may also be used.
• As the users perform these tasks they are watched & recorded on video & their key presses are logged.
• This data is used to calculate performance times, identify errors & help explain why the users did what they did.
• User satisfaction questionnaires & interviews are used to elicit users’ opinions.
www.studentsfocus.com
![Page 77: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/77.jpg)
Usability testing
• It is very time consuming to conduct and analyze– Explain the system, do some training– Explain the task, do a mock task– Questionnaires before and after the test & after each
task– Pilot test is usually needed
• Insufficient number of subjects for ‘proper’ statistical analysis
• In laboratory conditions, subjects do not behave exactly like in a normal environment
www.studentsfocus.com
![Page 78: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/78.jpg)
Field studies
• Field studies are done in natural settings• The aim is to understand what users do
naturally and how technology impacts them.• In product design field studies can be used to:
- identify opportunities for new technology- determine design requirements - decide how best to introduce new technology- evaluate technology in use
www.studentsfocus.com
![Page 79: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/79.jpg)
Predictive evaluation
• Experts apply their knowledge of typical users, often guided by heuristics, to predict usability problems.
• Another approach involves theoretically based models.
• A key feature of predictive evaluation is that users need not be present
• Relatively quick & inexpensive
www.studentsfocus.com
![Page 80: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/80.jpg)
Overview of techniques
xObserving usersxDon’t interfere with the subjects !
xAsking users’ opinionsx Interviews, questionnaires
xAsking experts’ opinionsxHeuristics, role-playing; suggestions for
solutions
www.studentsfocus.com
![Page 81: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/81.jpg)
Overview of techniques
x Testing users’ performancex Time taken to complete a task, errors made,
navigation pathx Satisfaction
x Modeling users’ task performancex Appropriate for systems with limited functionalityx Make assumptions about the user’s typical,
optimal, or poor behaviourx Simulate the user and measure performance
www.studentsfocus.com
![Page 82: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/82.jpg)
Web Information Retrieval
ChallengesApproaches
www.studentsfocus.com
![Page 83: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/83.jpg)
Challenges• Scale, distribution of documents• Controversy over the unit of indexing
– What is a document ? (hypertext)– What does the use expect to be retrieved ?
• High heterogeneity– Document structure, size, quality, level of abstraction /
specialization– User search or domain expertise, expectations
• Retrieval strategies– What do people want ?
• Evaluationwww.studentsfocus.com
![Page 84: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/84.jpg)
Web documents / data• No traditional collection
– Huge• Time and space to crawl index• IRSs cannot store copies of documents
– Dynamic, volatile, anarchic, un-controlled– Homogeneous sub-collections
• Structure– In documents (un-/semi-/fully-structured)– Between docs: network of inter-connected nodes– Hyper-links - conceptual vs. physical documents
www.studentsfocus.com
![Page 85: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/85.jpg)
Web documents / data• Mark-up
– HTML – look & feel– XML – structure, semantics– Dublin Core Metadata– Can webpage authors be trusted to correctly mark-up
/ index their pages ?
• Multi-lingual documents
• Multi-media
www.studentsfocus.com
![Page 86: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/86.jpg)
Theoretical models forindexing / searching
• Content-based weighting– As in traditional IRS, but trying to incorporate
• hyperlinks• the dynamic nature of the Web (page validity, page caching)
• Link-based weighting– Quality of webpages
• Hubs & authorities• Bookmarked pages• Iterative estimation of quality
www.studentsfocus.com
![Page 87: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/87.jpg)
Architecture
• Centralized– Main server contains the index, built by an indexer,
searched by a query engine• Advantage: control, easy update• Disadvantage: system requirements (memory, disk,
safety/recovery)
• Distributed– Brokers & gatherers
• Advantage: flexibility, load balancing, redundancy• Disadvantage: software complexity, update
www.studentsfocus.com
![Page 88: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/88.jpg)
User variability
• Power and flexibility for expert users vs. intuitiveness and ease of use for novice users
• Multi-modal user interface– Distinguish between experts and beginners, offer
distinct interfaces (functionality)– Advantage: can make assumptions on users– Disadvantage: habit formation, cognitive shift
• Uni-modal interface– Make essential functionality obvious– Make advanced functionality accessible
www.studentsfocus.com
![Page 89: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/89.jpg)
Search strategies
• Web directories• Query-based searching• Link-based browsing (provided by the browser,
not the IRS)• “More like this”• Known site (bookmarking)
• A combination of the abovewww.studentsfocus.com
![Page 90: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/90.jpg)
Web IRS evaluation
• Effectiveness - problems– Search for documents vs. information– What is the target collection (the crawled and indexed
Web) today ?– Recall, relative recall, aspectual recall– Levels of relevance, quality, hubs & authorities– User-centered, task-oriented evaluation
• Task completion, user satisfaction
• Usability– Is there anything specific for Web IRSs ?
www.studentsfocus.com
![Page 91: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/91.jpg)
More advanced topics ofIR research
www.studentsfocus.com
![Page 92: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/92.jpg)
Support for Relevance Feedback
• RF can improve search effectiveness … but is rarely used
• Voluntary vs. forced feedback
• At document vs. word level
• “Magic” vs. controlwww.studentsfocus.com
![Page 93: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/93.jpg)
Term clustering
• Based on `similarity’ between terms– Collocation in documents, paragraphs, sentences
• Based on document clustering– Terms specific for bottom-level document clusters
are assumed to represent a topic
• Use– Thesauri– Query expansion
www.studentsfocus.com
![Page 94: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/94.jpg)
User modelling
• Build a model / profile of the user by recording– the `context’– topics of interest– preferences
based on interpreting (his/her actions):– Implicit or explicit relevance feedback– Recommendations from `peers’– Customization of the environment
www.studentsfocus.com
![Page 95: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/95.jpg)
Personalised systems
• Information filtering– Ex: in a TV guide only show programs of interest
• Use user model to disambiguate queries– Query expansion– Update the model continuously
• Customize the functionality and the look-and-feel of the system– Ex: skins; remember the levels of the user
interfacewww.studentsfocus.com
![Page 96: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/96.jpg)
Autonomous agents
• Purpose: find relevant information on behalf of the user
• Input: the user profile• Output: pull vs. push• Positive aspects:
– Can work in the background, implicitly– Can update the master with new, relevant info
• Negative aspects: control
• Integration with collaborative systemswww.studentsfocus.com
![Page 97: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/97.jpg)
Questions ?
www.studentsfocus.com
![Page 98: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/98.jpg)
Information Hierarchy
• Data– The raw material of information
• Information– Data organized or presented in some context
• Knowledge– Information read, heard or seen and understood
• Wisdom– Distilled and integrated knowledge and understanding
www.studentsfocus.com
![Page 99: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/99.jpg)
Meaning vs. Form
• Meaning– Indicates what the document is about, or the topic of the
document– Requires intelligent interpretation by a human or artificial
intelligence techniques
• Form– Refers to the the content per se, i.e. the words that make up
the document
www.studentsfocus.com
![Page 100: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/100.jpg)
Data vs. Information Retrieval
Matching Exact match Partial matchModel Deterministic ProbabilisticClassification Monothetic PolytheticQuery specification Complete IncompleteError response Sensitive Insensitive
(van Rijsbergen, C.J. (1979) http://www.dcs.gla.ac.uk/Keith/Preface.html)www.studentsfocus.com
![Page 101: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/101.jpg)
Relevance
• Depends on the individual and on the context
• Relevance vs. aboutness (for a topic)
• Relevance vs. usefulness (for a task)
• Relevance judgements in test collections– Allow for system evaluation
www.studentsfocus.com
![Page 102: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/102.jpg)
Collection/document structure• Examples
– Cacm– Reuters– Email– Web
• Issues– Identify indexing units / documents
• Mark-up, parsers
– Structured documents - what to index ?– Weighting scheme
www.studentsfocus.com
![Page 103: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/103.jpg)
Lexical analysis
• Break up the text in words or “tokens”• Question: “what is a word” ?
• Problem cases– Numbers: “M16”, “2001”– Hyphenation: “MS-DOS”, “OS/2”– Punctuation: “John’s”, “command.com”– Case: “us”, “US”– Phrases: “venetian blind”
www.studentsfocus.com
![Page 104: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/104.jpg)
Stopwords
• Very frequent words, with no power of discrimination
• Typically function words, not indicative of content
• The stopwords set depends on the document collection and on the application
www.studentsfocus.com
![Page 105: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/105.jpg)
Stemming• Identify morphological variants, creating
“classes”– system, systems– forget, forgetting, forgetful– analyse, analysis, analytical, analysing
• Use in an IR system– Replace each term by the class representative (root
or most common variant)– Replace each word by all the variants in its class
www.studentsfocus.com
![Page 106: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/106.jpg)
Stemming errors• Too aggressive
– organization / organ– police / policy– arm / army– execute / executive
• Too timid– european / europe– create / creation– search / searcher– cylinder / cylindricalwww.studentsfocus.com
![Page 107: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/107.jpg)
B-tree
term 1
term N
term 2
term k
search-index
dictionary postings lists
1 2 116
Inverted files
www.studentsfocus.com
![Page 108: An Introduction computer scientist - StudentsFocusstudentsfocus.com/notes/anna_university/CSE/7SEM/CS6007 - IR/not… · –Real content retrieval of other media (based on shape,](https://reader033.fdocuments.net/reader033/viewer/2022060416/5f13ec6ef7aa7844c62ac1e5/html5/thumbnails/108.jpg)
Inverted files
www.studentsfocus.com