WIRED Week 2
description
Transcript of WIRED Week 2
![Page 1: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/1.jpg)
WIRED Week 2WIRED Week 2
• Syllabus Update (at least another week)• Readings Overview
- Many ways to explain the same things- Always think of the user- Skip the math (mostly)
• Readings Review- Most complicated of the entire semester- Refer back to it often
• Readings Review - More Models• Non-Linear• Projects and/or Papers Overview
![Page 2: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/2.jpg)
Opening CreditsOpening Credits
• Material for these slides obtained from:- Modern Information Retrieval by Ricardo Baeza-Yates and
Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/
- Berthier Ribeiro-Neto - Introduction to Modern Information Retrieval by Gerard
Salton and Michael J. McGill, McGraw-Hill, 1983.- Ray Mooney CSE 8335 Spring 2003- Joydeep Ghosh
![Page 3: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/3.jpg)
• IR originally mostly for systems, not people• IR in the last 25 years:
1. classification and categorization2. systems and languages3. user interfaces and visualization
• A small world of concern• The Web changed everything
• Huge amount of accessible information• Varied information sources• Relatively easy to look for information• Improving IR means improving learning
• Digital technology changes everything (again)
Why IR?Why IR?
![Page 4: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/4.jpg)
WIRED FocusWIRED Focus
• Information Retrieval: representation, storage, organization of, and access to information items
• Focus is on the user information need• User information need:
- Find all docs containing information on Austin which:
• Are hosted by utexas.edu• Discuss restaurants
• Emphasis is on the retrieval of information (not data, not just a keyword match)
![Page 5: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/5.jpg)
So just what is Information then?So just what is Information then?
• Oh no, not more about informaiton....
• “The difference that makes a difference”Gregory Bateson
• Element in the communications process- Information Theory- Data
• Something that informs a user- Not just Data
• “Orange”• “1,741,405.339”
- Helps users learn- Helps users make decisions- Facts (in context)
![Page 6: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/6.jpg)
DifferencesDifferences• Data retrieval
- Keywords match for documents- Well-defined semantics- A single erroneous object implies failure
• Information retrieval- information about a subject or topic- Loose semantics (language issues)- Small errors are tolerated
• IR system:- interpret contents of information items (documents)- generate a ranking which reflects relevance for the query- notion of relevance is most important
![Page 7: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/7.jpg)
Let’s Talk about FindingLet’s Talk about Finding
• How many ways are there to find something?
• Find a specific thing?
• Find a concept?
• Learn about a topic?
![Page 8: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/8.jpg)
Ways of Finding are IR modelsWays of Finding are IR models
• Each way of finding is one type of Information Retrieval
• Different ways to search for person, place or thing
• Real life information retrieval combines several of the methods- All at once- In succession
![Page 9: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/9.jpg)
Retrieval
Browsing
“Database”
User Interaction with (IR) SystemUser Interaction with (IR) System
• Users can do 2 distinct tasks- Which one is more important?- Can you do both at once?- Which is more difficult?
• Leverage the strengths of each
![Page 10: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/10.jpg)
Documents
Information Need
index
query
Rankingmatch
documents
?
Quick Overview of the IR ProcessQuick Overview of the IR Process
![Page 11: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/11.jpg)
structure
Manual indexingDocs
structure Full text Index terms
How do you get an index?How do you get an index?
• A “bag of words” with these logical parts that are processed• The Web has (some) structure in markup languages• Follow from Left to Right to see the document get
condensed• Structure + the right Logical parts (+ things added) = Index
Accentsspacing stopwords
Noungroups stemming
![Page 12: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/12.jpg)
Our friend the IndexOur friend the Index• IR systems usually rely on an index or indices
to process queries• Index term:
- a keyword or group of selected words• “Golf” or “Golf Swing”
- any word (not always actually in the document!)• “Waste of time”
• Stemming might be used:- connect: connecting, connection, connections
• An inverted file is built for the chosen index terms- A list of each keyword (or keyword group) and its
location in the document(s)
![Page 13: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/13.jpg)
Indexes, huh?Indexes, huh?
• Matching at index term level is not always the best way to find
• Users get confused and frustrated (without an understanding of the IR Model)
• Search features not easy- Number of search terms is small- Results not presented well
• Web searching actually not so easy either- Junk pages- Ranking games- Duplicate information- Bad page or site Information Architecture
![Page 14: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/14.jpg)
More about indexesMore about indexes
• Indexes make sense of the order (to the system)
• Relevance is measured for the user’s query among the indices
• Ranking of results is how the index is exposed to the user.
![Page 15: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/15.jpg)
RankingRanking
• A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query
• A ranking is based on fundamental premisses regarding the notion of relevance, such as:- common sets of index terms- sharing of weighted terms- likelihood of relevance
• Each set of premisses leads to a distinct IR model
![Page 16: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/16.jpg)
RelevanceRelevance
• Relevance is the correct document for your situation.
• Relevance feedback is doing the search again with changes to search terms- By the user- By the system
• Called Query Reformulation• Relevance is based on the user• A huge area for improvement in search
interfaces
![Page 17: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/17.jpg)
IR ModelsIR Models
• Classic IR Models- Boolean Model- Vector Model- Probabilistic Model
• Set Theoretic Models- Fuzzy Set Model- Extended Boolean Model
• Generalized Vector Model• Latent Semantic Indexing• Neural Network Model
![Page 18: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/18.jpg)
Non-Overlapping ListsProximal Nodes
Structured Models
Retrieval: Adhoc Filtering
Browsing
U s e r
T a s k
Classic Models
BooleanVectorProbablistic
Set Theoretic
Fuzzy Extended Boolean
Probabilistic
Inference Network Belief Network
Algebraic
Generalized Vector Lat. Semantic Index Neural Networks
Browsing
FlatStructure GuidedHypertext
Types of IR ModelsTypes of IR Models
![Page 19: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/19.jpg)
Index Terms
Full Text
Full Text + Structure
Retrieval
Classic
Set Theoretic Algebraic
Probabilistic
Classic
Set Theoretic Algebraic
Probabilistic
Structured
Browsing
Flat
Flat
Hypertext
Structure Guided
Hypertext
LOGICAL VIEW OF DOCUMENTS
USER T A S K
The "Usual Suspects”The "Usual Suspects”
![Page 20: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/20.jpg)
Classic IR Models - BasicsClassic IR Models - Basics
• Each document represented by a set of representative keywords or index terms
• An index term is a word from the document that describes the document- Classicaly, index terms are nouns because nouns
have meaning by themselves- Most human indexers use nouns & verbs- Now, search engines assume that all words are
index terms (full text representation)• A great index has words added to it for additional
meaning and context
![Page 21: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/21.jpg)
Classic IR Models – More BasicsClassic IR Models – More Basics
• Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents
• The importance of the index terms is represented by weights associated to them- Ki - an index term- dj - a document - F - the framework for document representations- R – ranking function in relation to query & document- wij - a weight associated with (ki,dj)- The weight wij quantifies the importance of the index
term for describing the document contents
![Page 22: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/22.jpg)
Classic IR Models – Basic VariablesClassic IR Models – Basic Variables
- t is the total number of index terms
- K = {k1, k2, …, kt} is the set of all index terms
- wij >= 0 is a weight associated with (ki,dj)
- wij = 0 indicates that term does not belong to doc
- dj= (w1j, w2j, …, wtj) is a weighted vector associated with the document dj
- gi(dj) = wij is a function which returns the weight associated with pair (ki,dj)
![Page 23: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/23.jpg)
Wait a minute!Wait a minute!
• Why are these variables so poorly labeled and organized.
• Tradition• Security through Obscurity?• Difference from other Mathematical
symbolism?
• Sadly, not consistent even in the IR literature
![Page 24: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/24.jpg)
Boolean ModelBoolean Model
• Simple model based on set theory• Queries specified as boolean expressions
- precise semantics- neat formalism- q = ka (kb kc)
• Terms are either present or absent. Thus, wij {0,1}
• Consider- q = ka (kb kc)
- qdnf = (1,1,1) (1,1,0) (1,0,0)
- qcc= (1,1,0) is a conjunctive component
![Page 25: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/25.jpg)
(1,1,1)(1,0,0)
(1,1,0)
Ka Kb
Kc
Boolean ModelBoolean Model
• q = ka (kb kc)
• sim(q,dj) = 1 if vec(qcc) | (vec(qcc)
vec(qdnf)) (ki, gi(vec(dj)) = gi(vec(qcc))) 0 otherwise
![Page 26: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/26.jpg)
Boolean Model ProblemsBoolean Model Problems
• Retrieval based is binary: no partial matching• No ranking of the documents is provided
(absence of a grading scale)• Users aren’t good at boolean queries
- too simplistic- wrong
• As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query
![Page 27: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/27.jpg)
Vector ModelVector Model
• Use of binary weights is too limiting• Non-binary weights provide consideration for
partial matches• These term weights are used to compute a
degree of similarity between a query and each document
• Ranked set of documents provides for better matching
![Page 28: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/28.jpg)
Vector ModelVector Model
• wij > 0 whenever ki appears in dj
• wiq >= 0 associated with the pair (ki,q)
• dj = (w1j, w2j, ..., wtj)
• q = (w1q, w2q, ..., wtq)
• To each term ki is associated a unitary vector i
• The unitary vectors i and j are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)
• The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors
![Page 29: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/29.jpg)
i
j
dj
q
Vector ModelVector Model
• Sim(q,dj) = cos()
= [dj q] / |dj| * |q|
= [ wij * wiq] / |dj| * |q|
• Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1
• A document is retrieved even if it matches the query terms only partially
![Page 30: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/30.jpg)
Vector ModelVector Model
• Sim(q,dj) = [ wij * wiq] / |dj| * |q|
• How to compute the weights wij and wiq ?
• A good weight must take into account two effects:- quantification of intra-document contents
(similarity)• tf factor, the term frequency within a document
- quantification of inter-documents separation (dissi-milarity)
• idf factor, the inverse document frequency
- wij = tf(i,j) * idf(i)
![Page 31: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/31.jpg)
Vector ModelVector Model
- N be the total number of docs in the collection- ni be the number of docs which contain ki- freq(i,j) raw frequency of ki within dj
• A normalized tf factor is given by- f(i,j) = freq(i,j) / max(freq(l,j))- where the maximum is computed over all terms
which occur within the document dj
• The idf factor is computed as- idf(i) = log (N/ni)- the log is used to make the values of tf and idf
comparable. It can also be interpreted as the amount of information associated with the term ki.
![Page 32: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/32.jpg)
Vector ModelVector Model
• The best term-weighting schemes take both into account.- wij = fi,j * log(N/ni)
• This strategy is called a tf-idf weighting scheme
• Text frequency – Inverse document frequency- How rare is the word in this document in
comparison with other documents too?
![Page 33: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/33.jpg)
Vector ModelVector Model
• For the query term weights, a suggestion is- wiq = (0.5 + [0.5 * freqi,q / max(freql,q]) * log(N/ni)
• The vector model with tf-idf weights is a good ranking strategy with general collections
• The vector model is usually as good as any known ranking alternatives.
• It is also simple and fast to compute.
![Page 34: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/34.jpg)
Weights Weights wwij ij and and wwiqiq ? ?
• One approach is to examine the frequency of the occurence of a word in a document:
• Absolute frequency:- tf factor, the term frequency within a document
- freqi,j - raw frequency of ki within dj
- Both high-frequency and low-frequency terms may not actually be significant
• Relative frequency: tf divided by number of words in document
• Normalized frequency:
fi,j = (freqi,j)/(maxl freql,j)
![Page 35: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/35.jpg)
Inverse Document FrequencyInverse Document Frequency
• Importance of term may depend more on how it can distinguish between documents.
• Quantification of inter-documents separation• Dissimilarity not similarity• idf factor, the inverse document frequency
![Page 36: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/36.jpg)
IDFIDF
• N = the total number of docs in the collection
• ni = the number of docs which contain ki
• The idf factor is computed as
- idfi = log (N/ni)
- the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.
• For example:
- N=1000, n1=100, n2=500, n3=800
- idf1= 3 - 2 = 1
- idf2= 3 – 2.7 = 0.3
- idf3 = 3 – 2.9 = 0.1
![Page 37: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/37.jpg)
Vector Model ConsideredVector Model Considered
• Most Common• Advantages:
- term-weighting improves quality of the answer set- partial matching allows retrieval of docs that
approximate the query conditions- cosine ranking formula sorts documents according
to degree of similarity to the query- Computationally Efficient
• Disadvantages:- assumes independence of index terms - not clear
that this is bad though- Not best if applied exclusively
![Page 38: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/38.jpg)
Probabilistic ModelProbabilistic Model• Uses a probabilistic framework to get a result
- Given a user query, there is an ideal answer set- Querying as specification of the properties of this
ideal answer set (clustering)
• But, what are these properties? • Guess at the beginning what they could be
(i.e., guess initial description of ideal answer set)
• Improve probablity by iteration & additional data
![Page 39: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/39.jpg)
Probabilistic Model IteratedProbabilistic Model Iterated
• An initial set of documents is retrieved• User inspects these docs looking for the relevant
ones (in truth, only top 10-20 need to be inspected)• IR system uses this information to refine description
of ideal answer set• By repeting this process, it is expected that the
description of the ideal answer set will improve• Have always in mind the need to guess at the very
beginning the description of the ideal answer set• Description of ideal answer set is modeled in
probabilistic terms
![Page 40: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/40.jpg)
Probabilistic Ranking PrincipleProbabilistic Ranking Principle
• Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant).
• The model assumes that this probability of relevance depends on the query and the document representations only. Ideal answer set is referred to as R and should maximize the probability of relevance.
• Documents in the set R are predicted to be relevant. • How to compute probabilities?• What is the sample space?
![Page 41: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/41.jpg)
Probabilistic RankingProbabilistic Ranking
• Probabilistic ranking computed as:- sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to
q)
- This is the odds of the document dj being relevant
- Taking the odds minimize the probability of an erroneous judgement
• Definition:- wij {0,1}
- P(R | dj) : probability that given doc is relevant
- P(R | dj) : probability doc is not relevant
![Page 42: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/42.jpg)
Prob Ranking Pros & ConsProb Ranking Pros & Cons
• Advantages:- Docs ranked in decreasing order of probability of
relevance
• Disadvantages:- need to guess initial estimates for P(ki | R)- method does not take into account tf and idf
factors
![Page 43: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/43.jpg)
Models ComparedModels Compared
• Boolean model does not provide for partial matches and is considered to be the weakest classic model
• Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections
• This seems also to be the view of the research community
![Page 44: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/44.jpg)
Break!Break!
• Fuzzy Models• Extended Models• Generalized Models• Advanced Models
![Page 45: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/45.jpg)
• The Boolean model imposes a binary criterion for deciding relevance
• How about something a little more complex?
• Extend the Boolean model to use partial matching and a ranking is a goal
• Two set theoretic models- Fuzzy Set Model- Extended Boolean Model
Set Theoretic ModelsSet Theoretic Models
![Page 46: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/46.jpg)
Fuzzy Set ModelFuzzy Set Model
• Queries and docs represented by sets of index terms: matching is approximate from the start
• This vagueness can be modeled using a fuzzy framework, as follows:- with each term is associated a fuzzy set- each doc has a degree of membership in this fuzzy
set
• This interpretation provides the foundation for many models for IR based on fuzzy theory
• In here, we discuss the model proposed by Ogawa, Morita, and Kobayashi (1991)
![Page 47: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/47.jpg)
Fuzzy Set TheoryFuzzy Set Theory
• Framework for representing classes whose boundaries are not well defined
• Key idea is to introduce the notion of a degree of membership associated with the elements of a set
• This degree of membership varies from 0 to 1 and allows modeling the notion of marginal membership
• Thus, membership is now a gradual notion, contrary to the crispy notion enforced by classic Boolean logic
![Page 48: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/48.jpg)
Fuzzy Information RetrievalFuzzy Information Retrieval
• Fuzzy sets are modeled based on a thesaurus • This thesaurus is built as follows:
- Let vec(c) be a term-term correlation matrix- Let c(i,l) be a normalized correlation factor for
(ki,kl):c(i,l) = n(i,l)
ni + nl - n(i,l)- ni: number of docs which contain ki- nl: number of docs which contain kl- n(i,l): number of docs which contain both ki and kl
• This allows for proximity among index terms.
![Page 49: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/49.jpg)
Fuzzy Model IssuesFuzzy Model Issues
• The correlation factor c(i,l) can be used to define fuzzy set membership for a document dj as follows.
• Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory
• Experiments with standard test collections are not available
• Difficult to compare at this time
![Page 50: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/50.jpg)
Extended Boolean ModelExtended Boolean Model• Boolean model is simple and elegant.• But, no provision for a ranking• As with the fuzzy model, a ranking can be
obtained by relaxing the condition on set membership
• Extend the Boolean model with the notions of partial matching and term weighting
• Combine characteristics of the Vector model with properties of Boolean algebra
![Page 51: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/51.jpg)
Latent Semantic Indexing (LSI)Latent Semantic Indexing (LSI)
• Classic IR might lead to poor retrieval due to:- unrelated documents might be included in the
answer set- relevant documents that do not contain at least
one index term are not retrieved- Reasoning: retrieval based on index terms is vague
and noisy
• The user information need is more related to concepts and ideas than to index terms
• A document that shares concepts with another document known to be relevant might be of interest
![Page 52: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/52.jpg)
• Mapping documents and queries into a lower dimensional space (i.e., composed of higher level concepts which are in fewer number than the index terms) reduces complexity
• Retrieval in this reduced concept space might be superior to retrieval in the space of index terms
• It allows reducing the complexity of the representational framework which might be explored, for instance, with the purpose of interfacing with the user
Latent Semantic Indexing IssuesLatent Semantic Indexing Issues
![Page 53: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/53.jpg)
• Neural Networks:- The human brain is composed of billions of neurons- Each neuron can be viewed as a small processing unit- A neuron is stimulated by input signals and emits
output signals in reaction- A chain reaction of propagating signals is called a
spread activation process - As a result of spread activation, the brain might
command the body to take physical reactions
Neural Network ModelNeural Network Model
![Page 54: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/54.jpg)
• A neural network is an oversimplified representation of the neuron interconnections in the human brain:- nodes are processing units- edges are synaptic connections- the strength of a propagating signal is modelled by a
weight assigned to each edge- the state of a node is defined by its activation level- depending on its activation level, a node might issue
an output signal
• Neural Nets are good at recognizing patterns
Neural Network ModelNeural Network Model
![Page 55: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/55.jpg)
• Very difficult to test (and understand)• Training and working document set differences• Has not been tested extensively• May only be good for selected cases• Improvement over traditional models not
consistently proven
Neural Network IssuesNeural Network Issues
![Page 56: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/56.jpg)
• Probability Theory- Semantically clear- Computationally clumsy
• Why Bayesian Networks? - Clear formalism to combine evidences- Modularize the world (dependencies)- Bayesian Network Models for IR
• Inference Network (Turtle & Croft, 1991)• Belief Network (Ribeiro-Neto & Muntz, 1996)
Alternative Probabilistic ModelsAlternative Probabilistic Models
![Page 57: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/57.jpg)
• Epistemological view of the IR problem• Random variables associated with documents,
index terms and queries• A random variable associated with a
document, dj represents the event of observing that documentThe prior probability P(dj) reflects the probability
associated to the event of observing a given document dj
Inference Network ModelInference Network Model
![Page 58: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/58.jpg)
• Similar to the Inference Network Model- Epistemological view of the IR problem- Random variables associated with documents,
index terms and queries
• Contrary to the Inference Network Model- Clearly defined sample space- Set-theoretic view- Different network topology
Belief Network ModelBelief Network Model
![Page 59: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/59.jpg)
• Probability Based- Frequency- Empirical- Chain of conditions (parents)
• In a Bayesian network each variable is conditionally independent of all its non-descendants, given its parents.
• Dynamic Data difficulties• Broad collections of data applicable
Bayesian Inference ModelsBayesian Inference Models
![Page 60: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/60.jpg)
• Inference Network model is the first and well known
• Belief Network model- adopts a set-theoretic view- adopts a clearly define sample space- provides a separation between query and document
portions - able to reproduce any ranking produced by the
Inference Network while the converse is not true (for example: the ranking of the standard vector model)
Model ComparisonsModel Comparisons
![Page 61: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/61.jpg)
• Computational costs• Inference Network Model one document node at a
time then is linear on number of documents• Belief Network only the states that activate each
query term are considered• The networks do not impose additional costs
because the networks do not include cycles.
• The major strength is net combination of distinct evidential sources to support the rank of a given document.
Model ComparisonsModel Comparisons
![Page 62: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/62.jpg)
Structured ModelsStructured Models• Traditional models are (mostly) keyword-
based• They consider the documents are flat i.e., a
word in the title has the same weight as a word in the body of the document
• Document structure is one additional piece of information which can used- Words appearing in the title or in sub-titles within
the document- Structured Markup
![Page 63: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/63.jpg)
What Could Be Improved?What Could Be Improved?
• Advanced interfaces that facilitate the specification of the structure are also highly desirable
• Hybrid models- Combining models- Combine view of data- Complex, phased processing of text
• Structured text models should ranking• Metadata should have an impact
![Page 64: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/64.jpg)
• How can (Web) IR be better?- Better IR models- Better User Interfaces
• More to find vs. easier to find
• Scriptable applications• New interfaces for applications• New datasets for applications
Projects and/or Papers OverviewProjects and/or Papers Overview
![Page 65: WIRED Week 2](https://reader036.fdocuments.net/reader036/viewer/2022062800/56813f57550346895daa2284/html5/thumbnails/65.jpg)
Project Idea #1 – simple HTMLProject Idea #1 – simple HTML
• Graphical Google
• What kind of document?
• When was the document created?