Thinking in n-spaces Section 4 of Linguistics in the context of the Cognitive and Computational...

Thinking in n-spaces

Section 4 of

Linguistics in the context of the Cognitive and Computational

Sciences

When you need to manipulate n different numbers, you're living

in n-space.

Vectors

• A vector can be thought of in 3 ways:• 1. It is set of numbers ordered in a particular

way, which by convention are written inside parentheses and separated by commas.

• (1,4,0) • (5)• (-.5, 0, .1) etc.

The space a vector lives in ...

• A vector with n components lives in a particular space with lots of other vectors. A space has a size which we call its dimensionality. There is 1-space, which is a line, 2-space (a plane), 3-space (familiar space), 4-space, …n-space.

• If you can imagine 3-space, then that’s good enough for n-space.

1-space

0 1 2 3-1-2-3

2-space (2,1) and (4,2)

(1,2) + (3,1) = (4,3)

3 space: (30, 10, 60)

• When you need to manipulate n different numbers, you're living in n-space.

• Things are more complex, and our usual intuitions must be modified for this new world.

(Later we'll talk about reducing the dimensionality of a space while keeping its points: changing the basis of the space.)

Vectors: 2nd view

• Avector with n components (an n-vector) can be thought of as a particular point in a space of n-dimensions. To make it particularly visible, you can think of a line connecting the origin (the origin?) to that point.

• In this sense, the point exists independently of the coordinate system. That’s hardly the case for the first way of thinking of a vector.

Vectors: 3rd view

• Thus we can think of a vector as a length and a direction. In that way we don’t have to think of the vector as actually starting from the origin and ending up at some point. The vector can be translated (=moved) anywhere we like. Writing a vector in coordinates doesn’t emphasize this, but it’s true anyway.

Length of a vector

• How long is a vector? This is linked to (and usually discussed after) the question of comparing two vectors; but let’s just point out for now that the most common and usual way of defining the length of a vector is by using the Pythagorean theorem...

Length of a hypotenuse...

x

y

Length = sqrt( x2 + y2)

Length….

• You square each of the coordinates, and add up all those squares;

• then take the square root of that sum.

Negative coordinates

• Remember, a coodinate can be negative, but that negativity contributes to length just as much as positive: (2, -5) is exactly as long as (2,5). Squaring takes care of that.

N-dimensions...

Length of a vector is the square root of the sum of all the squares of

the coordinates.

Sqroot ( Sum xi2 )

that is, Sqroot ( xi2 )

that is , ( xi2 )1/2

Sometimes we care a lot about length; sometimes we don’t care

at all about length

If we don’t care about length, and only want to compare the direction that a set of vectors are pointing in, then we normalize the vectors. This means:

We divide each of them by their length:

Which means, we make them all land on a hypersphere of radius 1.0. Got that?

Normalizing vectors.

• Remember that dividing a vector by a number (that number is its length) means dividing each coordinate by that number.

• So half of (12,6,4) is (6,3,2): it’s a vector in the same direction, just half as long.

How close are two vectors?You can always tell how similar (=close) two

vectors are. The most common way is by taking the inner product (a.k.a dot product):

V . W = Sum ( Product of corresponding coordinates)

= v[i] * w[i]This is also equal to:

the length of V *

length of W *

cosine of the angle between them.

Distance between 2 vectors

Or you can find the cosine of the angle between the two vectors:

Inner product of A and B =

Sum of products of each dimension =

Length of A * Length of B *

cosine of the angle between.

So cos (a) = A . B / |A| |B|

• Remember: the cosine of an angle goes like this: cos (0) is 1.0, but then it gets smaller, to the point where if the angle is 90 degrees -- a right angle -- the cosine is 0.0.

• Two vectors being at right angles is a very important condition: we say that they are orthogonal. They have nothing to do with each other.

The length of a vector (bis)

The length of a vector V is the square root of inner prodcut of V with itself:

|V| = sqrt(V . V)

Projection of 1 vector onto another

• Projection of A onto B is (just) (A.B)/|B| -- which is just A.B if B is normalized.

A

B

Projection of A onto B

Distance between two vectors

Another way to measure how similar or different two vectors A, B are is to measure the length of the line that connects them:

that line can be thought of as A-B, so its coordinates are (a1-b1, a2-b2, ….), and its length must be

Sqroot ( (ai - bi)2 )

Addition of 2 vectors...

• Just add the corresponding components.

2 vectors in the same space...

• …are of the same number of coordinates, and can be added or dot-product-ed. You can’t do either operation to pairs of vectors not living in the same space.

Cross product of 2 vectors (tensor product)

• You can take the cross product of two vectors not living in the same space. You can take a cross product of a vector in m-space and one in n-space. This makes a matrix (we’ll get to matrices next time) of size m x n. That’s what we did in setting up the weight space of a network.

Changing the origin of the space• Sometimes we want to change what counts as the

zero-point of the vector space, but in some sense keep the vectors fixed.

• The most common reason to want to do this is to make the origin be in the middle of the data --

• That is, if we have a bunch of data points, we want to make the zero value on each dimension be the average of all of the values on that dimension -- so that the new average value really is zero….

• A set {xi}N = { (xi1, xi

22, xi3, …)}

• Take the average over j’s:

• 1/N j xj1 -- that’s the first coordinate of the new

origin.

• In most linguistic cases (all?), the raw scores are semi-positive (non-negative), so the average values are positive. But when we shift the origin to the mean value, roughly half of the new coordinates are negative: that’s what we want.

1st link between neural nets and vectors….

Consider a linear associator -- N input nodes, and just one output unit O, let’s say for simplicity’s sake.

The input to the N input nodes can be viewed as an N-vector (once we’ve numbered those nodes); call it Input. The activation coming into unit O is the inner (dot) product of two vectors...

• The input vector Input…and

• a vector whose coordinates are the weights connecting each input unit to the output unit.

• Yes! Typically we want those values to add up to 1.0: so the set of weights connecting to each output unit is a vector C, living in a space of N dimensions, the same space in which the input vectors live.

• Each output unit can, and should, be visualized as a vector in the input space -- typically a vector of length 1 (normalized)….

• And the input to that output unit is the inner product of its vector, and the input vector. In short, if the two are close (similar), the output unit gets lots of activation; if they’re far apart, the output unit gets little.

Again...

• So think of each output unit as a reporting station telling us how much of the input vector is projected onto its vector in the input space.

Some linguistic examples

Words as vectors in 26-dimensional space:

cab maps to (1,1,1,0,0,….).

Madam maps to (2,0,0,1,…,2,…..0)with that middle 1at dimension 13.

What do normal distance metrics tell us about this representation?

• They tell us something about similarity, abstracting away from linearity and distance between letters. We can identify a sequence abc regardless of where it appears in a word (though we’ll confuse it with the same string scattered throughout the word).

Question: what’s the relative price of these two things: excluding all words that fail to have an a, a b, and a c; and…

including extra words that have an a, a b, and a c, but not in that order.

One answer: if we use only 1’s and 0’s for our counters, we can do this very fast on traditional bit-based computers.

• Computers can do bit-based arithmetic very fast.

Morphology• Suppose we consider a simplified language, where

there are a large number of stems, and to each stem we assign (observationally) a signature, which summarizes the information about what suffixes appear on that stem in the corpus. Using (non-vector) notation, we say that there are stems with the signature ed.ing, and others with s.’s. Using vectors, we describe those as (1,1,0,0) and (0,0,1,1) respectively, for a stem that appears one time with the relevant suffix….

• Suppose we have various stems with such signature vectors as:

• jump (3,3,0,0)

• kick (5,5,0,0)

• boy (0,0,2,2)

• girl (0,0,2,2)

• Obviously these are very artificial; even given the artifice of the example, having the same number of each suffix isn’t necessary. We’ll change that later.

(Of) How many dimensions is this space?

• The data demand that they be analyzed in a 4-dimensional space; but the analysis will show us that only a 1-dimensional space is necessary.

• Analysis can be identified with a reduction of the dimensionality of the data space.

• That’s the most important point of this course.

From 2 dimensions to 1 dimension

ing

‘s

Lots of stems with ‘s, but no ing (0,3), (0,5)

Lots of stems with ing but no’s (2,0), (6,0)

x x

x

x

New axis

Analysis….

• What we would like to have is a single dimension along which a positive integer would tell us how many ings were observed; a negative value would tell us how many ‘s were observed; and there would be no way to express any kind of arrangement of the sort we don’t find.

New axis• New axis could be a line that runs through

(0,0) but has a slope of -1: so the vectors (1,-1) and (-1,1) lie on it

ing

‘s

x x

x

x

New axis

• What is this new axis? It is the “noun-verb” (better, Category 1/Category 2) distinction.

• Is it clear that we could do exactly the same thing with our original 4-dimensional data:

• jump (3,3,0,0) becomes (3)• kick (5,5,0,0) becomes (5)• boy (0,0,2,2) becomes (-2)• girl (0,0,2,2) becomes (-2)

Rotation of coordinates (I.e., of basis)

• We just rotated the basis, and found that we then could dispense with the second dimension. It’s always easy to rotate a coordinate system. Consider the 2-dimensional case, in a plane.

Rotation

x

(x,y)

New coordinate is (x’, y’)Angle

x cos

y

Two similar triangles; the sum of their hypotenuses = y;

the sum of the two sides is sin

Rotation

So the new x’ value is:

• x’ = x cos + y sin .

Doing much the same thing on the y-axis:

• y’ = - x sin + y cos .

Next time we’ll rewrite this:

x’ = x cos + y sin .

y’ = -x sin + y cos .

x’ cos sin y’ -sin cos

x

y=

Represent words as bigrams

• How big is the space of bigrams? 272, since we virtually always care about marking ends of words.

• Thus each word can be represented in the space of bigrams: this is a very sparse representation: the average number of non-zeros per word is very low.

• But if we want to measure similarity between two words, and if we want to do lots of measuring of pairs of words (remember, if there are 1,000 words, then there are a million pairs!), then it maybe best to set up a bigram representation for each word and do a vector comparison.

• In practical computational terms, when the vectors are very sparse, we don’t keep track of all the zeros -- just keep track of which dimensions have non-zero values, and what those values are.

Syntax

• We can define a category in the following way.

• If there are N words in the language, then we define a space with N-dimensions, one for each word. Let’s define a raw category as a vector in that space with integral values, all 0 or greater than 0 (these are counts). The count of a raw category is the sum of its coordinates. A (regular) category is a raw category divided by its total count:

Raw, regular category:

{the 3, my 2, these3, ….}….

(3,2,3,…) raw category

total count = 73.

Regular category:

(3/73, 2/73, 3/72).

The coordinates of a regular category form a distribution: a set of numbers that add up to 1.0

Distribution...

• Remember, a distribution is a set of numbers that add up to 1.0.

• The vectors whose coordinates are distributions all lie on a plane (hyperplane) in their n-space.

• Similar to, but different from, the points which lie on a (hyper) sphere in that space -- different definition of length.

Category

• The set of left-neighbors of a word forms a category.

• Likewise the right-neighbors of a word.

• In any particular corpus, there won’t be any perfect identities between the left-hand neighbors and the right-hand neighbors of anyword. Remember: identity means agreement on all coordinates.

• A space can be set up in the first place using coordinates based on (low-level) observations (like categories were). That has a dimensionality based on the number of words in the language.

• We will try to lower the dimensionality of the space by finding new ways to express vectors (=categories) with a smaller numberof coordinates.

• Virtually all analysis is of that sort: find a small number of dimensions which suffice to map data that originally lived in a space of much higher dimension.

Brief discussion of correlation and covariance

• When you have a set of measurements (e.g., counts of words in various contexts), the best simple guess of what you’ll find in the next context you look at is just the average frequency so far: just guess the average. With no further information, nothing could be a better guess.

Matrices

• To talk about correlation and covariance, we need to have an idea about matrices.

• A matrix is a two-dimensional array of numbers. It can be thought of as a collection of rows, lined up on top of each others; a collection of columns, one next to each other; a linear neural network….

Matrix multiplication…M times vector

Correlation matrix

• Wi,j = average ii *ij = 1/N allpatterns ii *ij

• Obviously(right?) this is symmetric, which means that Wi,j = Wj,i

• Our next task is to figure out how deviations from the average are justified in particular contexts -- that’s further knowledge.

• Suppose you are allowed to describe a context with just one number. (You have one dimension you’re allowed to describe, using it…). The rational choice is to use the dimension along which the variation is the greatest (deviation, on average, from the mean). Why?

• Clearly, if there’s a dimension which always shows up with its mean value, there’s no point in using your information to describe that. No, you use your information to describe that dimension along which variation is the greatest.

• That dimension is the (first) principal component. Component of what? (of the covariance.)...

• The correlation matrix is what you get by measuring the connection between the individual values of vector, when we sample a large number of vectors from a set.

• ai,j measures the correlation between values ai and aj: it’s the product of those values, averaged over the set of vectors.

• The covariance matrix is what you get when you measure the correlation between the deviations from each dimension’s mean

• 1/N ( xi - meani) (xj - meanj)

• If two variables are unrelated, independent, then their correlation is 0.

• If there’s a correlation between two variables, then there’s more variation along a dimension that is a (linear) combination of the two of them. If you rotate dimensions some, you’ll get more variation along one of the new dimensions that mixes two or more of the old ones. That’s a good deal; take it; rotate.

Brief overview...

It may seem hard to visualize a covariance matrix, but there’s a way.

Since it’s symmetric, there are n(n-1)/2 independent variables for a covariance matrix in n-space (x1,x2,x3,…).

You can make an identification of a symmetric with a quadratic form, which is

just a sum of a bunch of terms, where each term has a coefficient (I.e., a fixed number), and exactly two variables (from our set of n coordinates).

For example, 2x12 + 3x1*x2.

And visually...

If we set one these quadratic forms equal to a constant….

If there are no “cross-terms” and all coeffecients are the same, you get a circle/sphere/hypersphere:

x12 + x2

2 = 1.

If the coefficients aren’t the same, you get an ellipse:

x12 + 2x2

2 = 1,

which is longer in the x1 direction.

• x12 + 4x2

2 = 1

• If you do have cross terms, then you still have an ellipse, but its axes are rotated:

x

X (1,0)

(-1,0)

x On circle: (.707, .707)

x

X=(.79, .79)

• So adding cross-terms gives you an ellipse (ellipsoid) where there’s an axis longer than any of the intersections with the axes.

• So to find greater variance, you’re best off rotating your axes -- to the axes of the rotated ellipse.

• These axes are the eigenvectors of the (correlation) matrix we were looking at.

Eigenvector

• For a matrix M, an eigenvector is a vector whose direction is left unchanged by the matrix (though its length may be unchanged). (Think of this as a memory in a Hopfield net, for example.)

Principal components (PCA)

• The first principal component is the (new) dimension along which the greatest variation is to be found.

• After that, you can find a second principal component, third, and so on.

• How? One way is by making a neural network.

Oja proposal

• Learning rule -- for a single output unit:

wi = O (xi - O wi) where O is the output,and is just a learning rate and xi is the input.

• This unit will learn so that its “input vector” -- its set of weights, considered as a vector in the input space -- is an eigenvector (or characteristic vector) of the net: a vector whose direction is left unchanged by the net. That’s the principal component. The magnification scale (= eigenvalue ) is the largest eigenvalue of the matrix.

Linear (in)dependence

For vectors, remember,

Direction = quality

Length = quantity,

so to speak.

If two vectors are co-linear, they express the same quality. We’re looking for the best set of vectors to serve as our coordinates in the new vector space. We don’t want two...

Vectors pointing in the same direction: they are redundant. Just pick one.

Once we have picked two vectors to be part of the new coordinate system (=basis), we don’t want to pick any vector which is a combination of the first two...

• Given vectors V1 and V2, any combination of them is: aV1 + bV2,where a and b are any two numbers. By varying a and b, we can reach any point on the plane that V1 and V2 sweep out. Any vector on that plane is said to be a linear combination of V1 and V2 . When we go for a new coordinate system,

We want a set of linearly independent vectors.

In practical terms, that means this. Suppose you have a first vector V. That could be a description of verb-like behavior. Then you want to add another vector W (perhaps has a lot of another category in it).

Find what W’s projection on to V is, and then subtract that from W. What’s left is independent of V.

Thinking in n-spaces Section 4 of Linguistics in the context of the Cognitive and Computational...

Documents

Transcript of Thinking in n-spaces Section 4 of Linguistics in the context of the Cognitive and Computational...