Artificial Intelligence Knowledge Representation Vector ...
Transcript of Artificial Intelligence Knowledge Representation Vector ...
Artificial IntelligenceKnowledge Representation
Vector Model
Andrea Torsello
What is Data?• Collection of data objects and
their attributes
• An attribute is a property or characteristic of an object– Examples: eye color of a person,
temperature of a room, etc.– Attribute is also known as variable,
field, characteristic, or feature
• A collection of attributes describe an object– Object is also known as record,
point, case, sample, entity, or instance
Tid Refund Marital Status
Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Attributes
Objects
Attribute Values
● Attribute values are numbers or symbols assigned to an attribute
● Distinction between attributes and attribute values● The same attribute can be mapped to different attribute values
– Depends on the measurement scale– Example: height can be measured in feet or meters
● Different attributes can be mapped to the same set of values– Example: Attribute values for ID and age are integers– But properties of attribute values can be different
● ID has no limit, but age has a maximum and minimum value
Features, patterns and classifiers
● Feature is any distinctive aspect, quality or characteristic● Features may be symbolic (i.e., color) or numeric (i.e., height)
● The combination of d features is represented as a d-dimensional column vector called a feature vector
● The d-dimensional space defined by the feature vector is called feature space
● Objects are represented as points in feature space. This representation is called a scatter plot
Features, patterns and classifiers
● Pattern is a composite of traits or features characteristic of an individual● In classification, a pattern is a pair of variables {x,ω} where● x is a collection of observations or features (feature vector)● ω is the concept behind the observation (label)
● What makes a “good” feature vector?● The quality of a feature vector is related to its ability to discriminate
examples from different classes– Examples from the same class should have similar feature values– Examples from different classes have different feature values
Features, patterns and classifiers
● More feature properties
● Classifiers● The goal of a classifier is to partition feature space into class-labeled decision
regions● Borders between decision regions are called decision boundaries
Components of a Pattern Recognition System
● A typical pattern recognition system contains● A sensor● A preprocessing mechanism● A feature extraction mechanism (manual or automated)● A classification or description algorithm● A set of examples (training set) already classified or described
Can I eat this mushroom?
I don’t know what type it is – I’ve never seen it before. Is it edible or poisonous?
Can I eat this mushroom? ● suppose we’re given examples of edible and poisonous mushrooms
● can we learn a model that can be used to classify other mushrooms?
edible
poisonous
Representing Instances using Feature Vectors
● we need some way to represent each instance● one common way to do this: use a fixed-length vector to● represent features (a.k.a. attributes) of each instance● also represent class label of each instance
Standard Feature Types
● nominal (including Boolean)● no ordering among possible values
e.g. color {red, blue, green} (vs. color = 1000 Hertz)
● linear (or ordinal)● possible values of the feature are totally ordered
e.g. size {small, medium, large} ← discrete– weight [0…500] ← continuous
● hierarchical● possible values are partially
ordered in an ISA hierarchy
Feature Hierarchy
Feature space
● we can think of each instance as representing a point in a d-dimensional feature space where d is the number of features
Feature Representation● Another view of the feature-vector representation: a single database table
Feature Vector Representation
● Feature Vector representations arise naturally when characterizing homogeneous entities with well-defined measurable quantities
● datum consists of measurements of well identified properties● there is no doubt about which measurement is which● Measurement can be placed in a vector of
– Fixed dimension– Known coordinates
Limits of Feature Vector Representation
● Feature Vector models are hard to use when:● It is not always obvious how to determine satisfactory features or they are
inefficient for learning purposes– experts cannot define features in a straightforward way– features consist of both numerical and categorical variables– missing or inhomogeneous data
● Feature Extraction
● Symmetries and invariances make it so that the identity of the measurement must be inferred from the context(i.e. image analysis)Contextual Pattern Recognition
● Measurables are overabundant, resulting in very high-dimensional data(curse of dimensionality)Dimensionality Reduction
● Entities are not homogeneous and vary in complexity, are described in terms of structural properties, such as parts and relations between partsStructural Pattern Recognition
Not a Universal Representation
Bag of Word Model
Feature extraction
● Feature extraction is concerned with ● Representing each unstructured data element in terms
of a record/vector of alphanumeric values, also called features
● This transformation requires the exploitation of a method to extract these features
● Why is it needed to represent data as sets of feature?● The reason is that many Information retrieval, Data
mining, and Pattern recognition methods need to use these representations to apply their algorithms
How do we represent text?
● How do we represent the complexities of language?● Keeping in mind that computers don’t “understand”
documents or queries● Simple, yet effective approach: “bag of words”
● Treat all the words in a document as index terms for that document
● Assign a “weight” to each term based on its “importance”
● Disregard order, structure, meaning, etc. of the words
Vector Representation
● “Bags of words” can be represented as vectors● Why? Computational efficiency, ease of manipulation
● A vector is a set of values recorded in any consistent order
“The quick brown fox jumped over the lazy dog’s back”
[ 1 1 1 1 1 1 1 1 2 ]
1st position corresponds to “back”2nd position corresponds to “brown”3rd position corresponds to “dog”4th position corresponds to “fox”5th position corresponds to “jump”6th position corresponds to “lazy”7th position corresponds to “over”8th position corresponds to “quick”9th position corresponds to “the”
Feature extraction to Represent Text Documents
The quick brown fox jumped over the lazy dog’s back.
Document 1
Document 2
Now is the time for all good men to come to the aid of their party.
the
is
for
to
of
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
0
0
1
1
0
1
1
0
1
1
0
0
1
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
1
0
1
0
1
1
Term Doc
umen
t 1
Doc
umen
t 2
Stopword List
Document Vector, where each
feature represents the
number of occurrences of
each term
Document Vector, where each
feature represents the
number of occurrences of
each term
Basic IR Models
● Boolean model● Based on the notion of sets● Documents are retrieved only if they satisfy Boolean
conditions specified in the query● Does not impose a ranking on retrieved documents● Exact match
● Vector space model● Based on geometry, the notion of vectors in high
dimensional space● Documents are ranked based on their similarity to the
query (ranked retrieval)● Best/partial match
Boolean Retrieval
● Weights assigned to terms are either “0” or “1” ● “0” represents “absence”: term isn’t in the document● “1” represents “presence”: term is in the document
● Build queries by combining terms with Boolean operators● AND, OR, NOT
● The system returns all documents that satisfy the query
Why do we say that Boolean retrieval is “set-based”?
Boolean View of a Collection
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
0
0
1
1
0
0
0
0
0
1
0
0
1
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
1
0
0
0
0
1
Term
Doc
1
Doc
2
0
0
1
1
0
1
1
0
1
1
0
0
1
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
0
0
0
1
Doc
3
Doc
4
0
0
0
1
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
1
0
0
1
Doc
5
Doc
6
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
1
0
0
0
1
0
0
1
0
0
1
1
1
1
0
0
0
Doc
7
Doc
8
Each column represents the view of a particular document: What terms are contained in this document?Each row represents the view
of a particular term: What documents contain this term?
To execute a query, pick out rows corresponding to query terms and then apply logic table of corresponding Boolean operator
Term-document incidence matrices
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
1 if play contains word, 0 otherwise
Brutus AND Caesar BUT NOT Calpurnia
Sec. 1.1
Can’t build the matrix
● 500K x 1M matrix has half-a-trillion 0’s and 1’s.
● But it has no more than one billion 1’s (1 G words)● matrix is extremely sparse.
● What’s a better representation?● We only record the 1 positions.
27
Sec. 1.1
Inverted index
Why Boolean Retrieval Works
● Boolean operators approximate natural language● Find documents about a good party that is not over
● AND can discover relationships between concepts● good party
● OR can discover alternate terminology● excellent party, wild party, etc.
● NOT can discover alternate meanings● Democratic party
See: http://sydney.edu.au/engineering/it/~matty/Shakespeare/test.htmlfor a search engine on Shakespeare exploiting the Boolean model
See: http://www.perunaenciclopediadantescadigitale.eu:8080/dantesearch/for a search engine on Dante exploiting the Boolean model
Why Boolean Retrieval Fails
● Natural language is way more complex
● AND “discovers” nonexistent relationships
● Terms in different sentences, paragraphs, …
● Guessing terminology for OR is hard
● good, nice, excellent, outstanding, awesome, …
● Guessing terms to exclude is even harder!
● Democratic party, party to a lawsuit, …
Strengths and Weaknesses
Strengths● Precise, if you know the
right strategies● Precise, if you have an idea
of what you’re looking for● Efficient for the computer
Weaknesses● Users must learn Boolean
logic● Boolean logic insufficient to
capture the richness of language
● No control over size of result set: either too many documents or none
● All documents in the result set are considered “equally good”
● Does not fit huge collections● No support for partial matches
Similarity-Based Queries
● Rank documents by their similarity with the query● Treat the query as if it were a document
– Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language
• Score its similarity to each document in the collection• Rank the documents by similarity score● Documents need not have all query terms
● Although documents with more query terms should be “better”
Documents as vectors
● We need a way of assigning a score to a query/document pair
• Let’s start with a one-term query● If the query term does not occur in the document: score should be 0
• The more frequent the query term in the document, the higher the score (should be)
● There are a large number of alternatives for this.
● So we have vector space with |V| dimensions– Terms are axes of the space● Documents (and Queries) are points or vectors in this space
• Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine
● These are very sparse vectors - most entries are zero.
Sec. 6.3
Why distance is a bad idea
The Euclidean distance between q
and d2 is large even though the
distribution of terms in the query q and the distribution of
terms in the document d2 are
very similar.
Sec. 6.3
Use angle instead of distance
• Thought experiment: take a document d and append it to itself. Call this document d′.
● “Semantically” d and d′ have the same content• The Euclidean distance between the two documents
can be quite large● The angle between the two documents is 0,
corresponding to maximal similarity.
• Key idea: measure similarity between documents according to angle between their vectorial representation
Sec. 6.3
Vector Space Model
Postulate: Documents that are “close together” in vector space “talk about” the same things
t1
d2
d1
d3
d4
d5
t3
t2
θ
φ
Therefore, retrieve documents based on how close the document is to the query (e.g., similarity ~ cosine of the angle)
Q
How do we weight doc terms in the vectors?
● Here’s the intuition:● Terms that appear often in a document should get high
weights
● Terms that appear in many documents should get low weights
● How do we capture this mathematically?● Term frequency● Inverse document frequency
The more often a document contains the term “dog”, the more likely that the document is “about” dogs.
Words like “the”, “a”, “of” appear in (nearly) all documents.
TFIDF