Artificial Intelligence Knowledge Representation Vector ...

Artificial IntelligenceKnowledge Representation

Vector Model

Andrea Torsello

What is Data?• Collection of data objects and

their attributes

• An attribute is a property or characteristic of an object– Examples: eye color of a person,

temperature of a room, etc.– Attribute is also known as variable,

field, characteristic, or feature

• A collection of attributes describe an object– Object is also known as record,

point, case, sample, entity, or instance

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Attributes

Objects

Attribute Values

● Attribute values are numbers or symbols assigned to an attribute

● Distinction between attributes and attribute values● The same attribute can be mapped to different attribute values

– Depends on the measurement scale– Example: height can be measured in feet or meters

● Different attributes can be mapped to the same set of values– Example: Attribute values for ID and age are integers– But properties of attribute values can be different

● ID has no limit, but age has a maximum and minimum value

Features, patterns and classifiers

● Feature is any distinctive aspect, quality or characteristic● Features may be symbolic (i.e., color) or numeric (i.e., height)

● The combination of d features is represented as a d-dimensional column vector called a feature vector

● The d-dimensional space defined by the feature vector is called feature space

● Objects are represented as points in feature space. This representation is called a scatter plot


● Pattern is a composite of traits or features characteristic of an individual● In classification, a pattern is a pair of variables {x,ω} where● x is a collection of observations or features (feature vector)● ω is the concept behind the observation (label)

● What makes a “good” feature vector?● The quality of a feature vector is related to its ability to discriminate

examples from different classes– Examples from the same class should have similar feature values– Examples from different classes have different feature values


● More feature properties

● Classifiers● The goal of a classifier is to partition feature space into class-labeled decision

regions● Borders between decision regions are called decision boundaries

Components of a Pattern Recognition System

● A typical pattern recognition system contains● A sensor● A preprocessing mechanism● A feature extraction mechanism (manual or automated)● A classification or description algorithm● A set of examples (training set) already classified or described

Can I eat this mushroom?

I don’t know what type it is – I’ve never seen it before. Is it edible or poisonous?

Can I eat this mushroom? ● suppose we’re given examples of edible and poisonous mushrooms

● can we learn a model that can be used to classify other mushrooms?

edible

poisonous

Representing Instances using Feature Vectors

● we need some way to represent each instance● one common way to do this: use a fixed-length vector to● represent features (a.k.a. attributes) of each instance● also represent class label of each instance

Standard Feature Types

● nominal (including Boolean)● no ordering among possible values

e.g. color {red, blue, green} (vs. color = 1000 Hertz)

● linear (or ordinal)● possible values of the feature are totally ordered

e.g. size {small, medium, large} ← discrete– weight [0…500] ← continuous

● hierarchical● possible values are partially

ordered in an ISA hierarchy

Feature Hierarchy

Feature space

● we can think of each instance as representing a point in a d-dimensional feature space where d is the number of features

Feature Representation● Another view of the feature-vector representation: a single database table

Feature Vector Representation

● Feature Vector representations arise naturally when characterizing homogeneous entities with well-defined measurable quantities

● datum consists of measurements of well identified properties● there is no doubt about which measurement is which● Measurement can be placed in a vector of

– Fixed dimension– Known coordinates

Limits of Feature Vector Representation

● Feature Vector models are hard to use when:● It is not always obvious how to determine satisfactory features or they are

inefficient for learning purposes– experts cannot define features in a straightforward way– features consist of both numerical and categorical variables– missing or inhomogeneous data

● Feature Extraction

● Symmetries and invariances make it so that the identity of the measurement must be inferred from the context(i.e. image analysis)Contextual Pattern Recognition

● Measurables are overabundant, resulting in very high-dimensional data(curse of dimensionality)Dimensionality Reduction

● Entities are not homogeneous and vary in complexity, are described in terms of structural properties, such as parts and relations between partsStructural Pattern Recognition

Not a Universal Representation

Bag of Word Model

Feature extraction

● Feature extraction is concerned with ● Representing each unstructured data element in terms

of a record/vector of alphanumeric values, also called features

● This transformation requires the exploitation of a method to extract these features

● Why is it needed to represent data as sets of feature?● The reason is that many Information retrieval, Data

mining, and Pattern recognition methods need to use these representations to apply their algorithms

How do we represent text?

● How do we represent the complexities of language?● Keeping in mind that computers don’t “understand”

documents or queries● Simple, yet effective approach: “bag of words”

● Treat all the words in a document as index terms for that document

● Assign a “weight” to each term based on its “importance”

● Disregard order, structure, meaning, etc. of the words

Vector Representation

● “Bags of words” can be represented as vectors● Why? Computational efficiency, ease of manipulation

● A vector is a set of values recorded in any consistent order

“The quick brown fox jumped over the lazy dog’s back”

[ 1 1 1 1 1 1 1 1 2 ]

1st position corresponds to “back”2nd position corresponds to “brown”3rd position corresponds to “dog”4th position corresponds to “fox”5th position corresponds to “jump”6th position corresponds to “lazy”7th position corresponds to “over”8th position corresponds to “quick”9th position corresponds to “the”

Feature extraction to Represent Text Documents

The quick brown fox jumped over the lazy dog’s back.

Document 1

Document 2

Now is the time for all good men to come to the aid of their party.

the

is

for

to

of

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

0

0

1

1

0

1

1

0

1

1

0

0

1

0

1

0

0

1

1

0

0

1

0

0

1

0

0

1

1

0

1

0

1

1

Term Doc

umen

t 1

Doc

umen

t 2

Stopword List

Document Vector, where each

feature represents the

number of occurrences of

each term

Document Vector, where each

feature represents the

number of occurrences of

each term

Basic IR Models

● Boolean model● Based on the notion of sets● Documents are retrieved only if they satisfy Boolean

conditions specified in the query● Does not impose a ranking on retrieved documents● Exact match

● Vector space model● Based on geometry, the notion of vectors in high

dimensional space● Documents are ranked based on their similarity to the

query (ranked retrieval)● Best/partial match

Boolean Retrieval

● Weights assigned to terms are either “0” or “1” ● “0” represents “absence”: term isn’t in the document● “1” represents “presence”: term is in the document

● Build queries by combining terms with Boolean operators● AND, OR, NOT

● The system returns all documents that satisfy the query

Why do we say that Boolean retrieval is “set-based”?

Boolean View of a Collection

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

0

0

1

1

0

0

0

0

0

1

0

0

1

0

1

1

0

0

1

0

0

1

0

0

1

0

0

1

1

0

0

0

0

1

Term

Doc

1

Doc

2

0

0

1

1

0

1

1

0

1

1

0

0

1

0

1

0

0

1

1

0

0

1

0

0

1

0

0

1

0

0

0

0

0

1

Doc

3

Doc

4

0

0

0

1

0

1

1

0

0

1

0

0

1

0

0

1

0

0

1

0

0

1

0

0

1

0

0

0

1

0

1

0

0

1

Doc

5

Doc

6

0

0

1

1

0

0

1

0

0

1

0

0

1

0

0

1

0

1

0

0

0

1

0

0

1

0

0

1

1

1

1

0

0

0

Doc

7

Doc

8

Each column represents the view of a particular document: What terms are contained in this document?Each row represents the view

of a particular term: What documents contain this term?

To execute a query, pick out rows corresponding to query terms and then apply logic table of corresponding Boolean operator

Term-document incidence matrices

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

1 if play contains word, 0 otherwise

Brutus AND Caesar BUT NOT Calpurnia

Sec. 1.1

Can’t build the matrix

● 500K x 1M matrix has half-a-trillion 0’s and 1’s.

● But it has no more than one billion 1’s (1 G words)● matrix is extremely sparse.

● What’s a better representation?● We only record the 1 positions.

27

Sec. 1.1

Inverted index

Why Boolean Retrieval Works

● Boolean operators approximate natural language● Find documents about a good party that is not over

● AND can discover relationships between concepts● good party

● OR can discover alternate terminology● excellent party, wild party, etc.

● NOT can discover alternate meanings● Democratic party

See: http://sydney.edu.au/engineering/it/~matty/Shakespeare/test.htmlfor a search engine on Shakespeare exploiting the Boolean model

See: http://www.perunaenciclopediadantescadigitale.eu:8080/dantesearch/for a search engine on Dante exploiting the Boolean model

http://sydney.edu.au/engineering/it/~matty/Shakespeare/test.html

http://www.perunaenciclopediadantescadigitale.eu:8080/dantesearch/

Why Boolean Retrieval Fails

● Natural language is way more complex

● AND “discovers” nonexistent relationships

● Terms in different sentences, paragraphs, …

● Guessing terminology for OR is hard

● good, nice, excellent, outstanding, awesome, …

● Guessing terms to exclude is even harder!

● Democratic party, party to a lawsuit, …

Strengths and Weaknesses

Strengths● Precise, if you know the

right strategies● Precise, if you have an idea

of what you’re looking for● Efficient for the computer

Weaknesses● Users must learn Boolean

logic● Boolean logic insufficient to

capture the richness of language

● No control over size of result set: either too many documents or none

● All documents in the result set are considered “equally good”

● Does not fit huge collections● No support for partial matches

Similarity-Based Queries

● Rank documents by their similarity with the query● Treat the query as if it were a document

– Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language

• Score its similarity to each document in the collection• Rank the documents by similarity score● Documents need not have all query terms

● Although documents with more query terms should be “better”

Documents as vectors

● We need a way of assigning a score to a query/document pair

• Let’s start with a one-term query● If the query term does not occur in the document: score should be 0

• The more frequent the query term in the document, the higher the score (should be)

● There are a large number of alternatives for this.

● So we have vector space with |V| dimensions– Terms are axes of the space● Documents (and Queries) are points or vectors in this space

• Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine

● These are very sparse vectors - most entries are zero.

Sec. 6.3

Why distance is a bad idea

The Euclidean distance between q

and d2 is large even though the

distribution of terms in the query q and the distribution of

terms in the document d2 are

very similar.

Sec. 6.3

Use angle instead of distance

• Thought experiment: take a document d and append it to itself. Call this document d′.

● “Semantically” d and d′ have the same content• The Euclidean distance between the two documents

can be quite large● The angle between the two documents is 0,

corresponding to maximal similarity.

• Key idea: measure similarity between documents according to angle between their vectorial representation

Sec. 6.3

Vector Space Model

Postulate: Documents that are “close together” in vector space “talk about” the same things

t1

d2

d1

d3

d4

d5

t3

t2

θ

φ

Therefore, retrieve documents based on how close the document is to the query (e.g., similarity ~ cosine of the angle)

Q

How do we weight doc terms in the vectors?

● Here’s the intuition:● Terms that appear often in a document should get high

weights

● Terms that appear in many documents should get low weights

● How do we capture this mathematically?● Term frequency● Inverse document frequency

The more often a document contains the term “dog”, the more likely that the document is “about” dogs.

Words like “the”, “a”, “of” appear in (nearly) all documents.

Artificial Intelligence Knowledge Representation Vector ...

Documents

Transcript of Artificial Intelligence Knowledge Representation Vector ...