Learning on the Test Data: Leveraging “Unseen” Features

Learning on the Test Data:Leveraging “Unseen” Features

Ben Taskar Ming FaiWong Daphne Koller

Introduction

• Most statistical learning models make the assumption that data instances are IID samples from some fixed distribution.

• In many cases, the data are collected from different sources, at different times, locations and under different circumstances.

• We usually build a statistical model of features under the assumption that future data will exhibit the same regularities as the training data.

• In many data sets, however, there are scope-limited features whose predictive power is only applicable to a certain subset of the data.

Examples 1. Classifying news articles chronologically: Suppose the task is to classify news articles chronologically. New

events, people and places appear and disappear) in bursts over time. The training data might consist of articles taken over some time

period; these are only somewhat representative of the future articles. The training data may contain some features that are not observed

in the training data.

2. Classifying customers into categories: Our training data might be collected from one geographical region

which may not represent the distribution in other regions.

We can get away with this difficulty by mixing all the examples and selecting the training and test sets randomly.

But this homogeneity cannot be ensured in real world task, where only the non-representative training data is actually available for training.

The test data may contain many features that were never or only rarely observed in training data. These features may be used for classification.

For ex, in the news article task these local features might include the names of places or people currently in the news.

In the customers ex, these local features might include purchases of products that are specific to a region.

Scoped LearningSuppose we want to classify news articles chronologically. The phrase

“XXX said today” might appear in many places in data for different values of “XXX”

These features are called scope limited features or local features.

Another example: Suppose there are 2 labels grain and trade. Words like corn or

wheat often appear in phrase “tons of wheat". So we can learn that if a word appears in the context of “tons of xxx” it is likely to be associated with grain. So if we find a phrase like “tons of rye” in the test data we can infer that it has some positive interaction with label grain.

Scoped learning is a probabilistic framework that combines the traditional IID features with scope limited features.

The intuitive procedure for using the local features is to use the information from the global (IID) features to infer the rules that govern the local information for a particular subset of data.

When data exhibits scope they found significant gains in performance over traditional models which only uses IID features.

All the data instances within a particular scope exhibit some structural regularity and we assume that all the future data will exhibit the same structural regularity.

General Framework:Notion of scope:

• We assume that data instances are sampled from some set of scopes, each of which is associated with some data distribution.

• Different distributions share a probabilistic model for some set of global features, but can contain a different probabilistic model for a scope-specific set of local features.

• These local features may be rarely or never seen in the scopes comprising the training data.

Let X denote global features, Z denote local features, and Y the class variable.For each global feature Xi, there is a parameter γi. Additionally,for each scope and each local feature Zi, there isa parameter λi

S.

Then the distribution of Y given all the features and weights is

Probabilistic model:

• We assume that the global weights can be learned from training data. So their values are fixed when we encounter a new scope and the local feature weights are unknown and can be treated as hidden variables in the graphical model.

• Idea:

The evidence from global features for the labels of some of the instances to modify our beliefs about the role of the local feature present in these instances to be consistent with the labels. By learning about the roles of these features, we can then propagate this information to improve accuracy on instances that are harder to classify using global features alone.

To implement this idea, we define a joint distribution over λS and y1, . . . , ym.

Why use Markov Random Fields:

Here the association between the variables are correlated rather than causal. Markov random fields are used to model spatial interactions or interacting features.

Markov Network• Let V = (Vd,Vc) denote a set of random variables, where Vd are

discrete and Vc are continuous variables, respectively.

• A Markov network over V defines a joint distribution over V, assigning a density over Vc for each possible assignment vd to Vd.

• • A Markov network M is an undirected graph whose nodes

correspond to V.

• It is parameterization by a set of potential functions φ1(C1), . . . , φl(Cl) such that each C V is a fully connected subgraph, or clique, in M, i.e., each Vi, Vj C are connected by an edge in M.

• Here we assume that the φ(C) is a log-quadratic function• The Markov network then represents the distribution:

• In our case the log-quadratic model consists of 3 types of potentials

• 1) φ(γii,,YYjj,X,Xiijj) =exp() =exp(γiiYYjjXXii

j)j)

relates each global feature XXiij j in instance i to its weight γi i and the class variables

Yj of the corresponding instance i.

• 2) φ(λλii,,YYjj,Z,Ziijj) = exp() = exp(λλiiYYjjZZii

j)j)

relates the local feature ZZiijj to its weight λλii and the label Y j

• Finally, as the local feature weights are assumed to be hidden, we introduce a prior over their values, or the form

• Overall, our model specifies a joint distribution as follows:

Markov network for two instances, two global features and three local features

• The graph can be simplified further when we account for varaibles whose values are fixed.

• The global feature weights are learned from the training data and hence their value is fixed and we also know all the feature values.

The resulting Markov network is shown below (Assuming that the instance (x1, z1, y1) contains the features Z1 and Z2, and the instance(x2, z2, y2) contains the features Z2 and Z3.)

Y2

λ1 λ2 λ3

Y1

This can be reduced further. When ZZiijj=0 there is no interaction between YYjj and

any of the variables λλi.i.

In this case we can simply omit the edge between λλi i andand YYjj

And the resulting Markov network is shown below

Y2

λ1 λ2 λ3

Y1

• In this model, we can see that the labels of all of the instances are correlated with the local feature weights of features they contain, and thereby with each other. Thus, for example, if we obtain evidence (from global features) about the label Y 1, it would change our posterior beliefs about the local feature weight ¸2, which in turn would change our beliefs about the label Y 2. Thus, by running probabilistic inference over this graphical model, we obtain updated beliefs both about the local feature weights and about the instance labels.

Learning the Model:Learning Global Feature Weights:

• In this case we simply learn their parameters from the training data, using standard logistic regression. Maximum-likelihood (ML) estimation finds the weights γγ that maximize the conditional likelihood of the labels given the global features.

Learning Local feature Distributions:

• We can exploit such patterns by learning a model that predicts the prior of the local feature weights using meta features— features of features. More precisely, we learna model that predicts the prior mean µi for ¸i from someset of meta-features mi. As our predictive model for the mean µi we choose to use a linear regression model, setting

µi = w ·mi.

Using the model

• Step1: Given a training set, we first learn the model. In the training set, there local

and global features are treated identically. When applying the model to the test set, however, our first decision is to determine the set of local and global features.

• Step 2: Our next step is to generate the Markov network for the test set.

Probabilistic inference over this model infers the effect of local features.

• Step 3: We use Expectation Propagation for inference. It maintains approximate

beliefs (marginals) over nodes of the Markov network and iteratively adjusts them to achieve local consistency.

Experimental Results:

• Reuters:

• The Reuters news articles data set contains substantial number of documents hand labeled into grain, crude, trade, and money-fx.

• Using this data set, six experimental setups are created, by using all possible pairings of categories from the four categories chosen.

• The resulting sequence is divided into nine time segments with roughly the same number of documents in each segment.

• WebKB2 This data set consists of hand-labeled web pages from Computer Science department

web sites of four schools: Berkeley, CMU, MIT and Stanford and they are categorized into faculty, student, course and organization.

• Six experimental setups are created by using all possible pairings of categories from the four categories.

Learning on the Test Data: Leveraging “Unseen” Features

Documents

Transcript of Learning on the Test Data: Leveraging “Unseen” Features