Testing the Significance of Attribute Interactions Aleks Jakulin & Ivan Bratko Faculty of Computer...

19
Testing the Testing the Significance Significance of Attribute Interactions of Attribute Interactions Aleks Jakulin & Ivan Bratko Aleks Jakulin & Ivan Bratko Faculty of Computer and Information Science University of Ljubljana Slovenia

Transcript of Testing the Significance of Attribute Interactions Aleks Jakulin & Ivan Bratko Faculty of Computer...

Testing the Testing the SignificanceSignificance of of Attribute InteractionsAttribute Interactions

Aleks Jakulin & Ivan BratkoAleks Jakulin & Ivan BratkoFaculty of Computer and Information Science

University of Ljubljana

Slovenia

OverviewOverview1. Interactions:

• The key to understanding many peculiarities in machine learning.

• Feature importance measures the 2-way interaction between an attribute and the label, but there are interactions of higher orders.

2. An information-theoretic view of interactions:• Information theory provides a simple “algebra” of

interactions, based on summing and subtracting entropy terms (e.g., mutual information).

3. Part-to-whole approximations:• An interaction is an irreducible dependence. Information-

theoretic expressions are model comparisons!4. Significance testing:

• As with all model comparisons, we can investigate the significance of the model difference.

Example 1Example 1: Feature Subset Selection with NBC: Feature Subset Selection with NBC

The calibration of the classifier (expected likelihood of an instance’s label) first improves then deteriorates as we add attributes. The optimal number is ~8 attributes. The first few attributes are important, the rest is noise?

Example 1Example 1: Feature Subset Selection with NBC: Feature Subset Selection with NBC

NO! We sorted the attributes from the worst to the best. It is some of the best attributes that deteriorate the performance! Why?

Example 2Example 2: Spiral/XOR/Parity Problems: Spiral/XOR/Parity Problems

Either attribute (x, y) is irrelevant when alone. Together, they make a perfect blue/red classifier.

What is going on?What is going on? InteractionsInteractions

C

BA

label

attribute attribute

importance of attribute Bimportance of attribute A

3-Way Interaction: What is common to A, B and C together;

and cannot be inferred from any subset of attributes.

attribute correlation

2-Way Interactions

QuantificationQuantification: Shannon’s Entropy: Shannon’s Entropy

C

Entropy given C’s empirical probability distribution (p = [0.2, 0.8]).

A

H(A)Information

which came with the knowledge of A

I(A;C)=H(A)+H(C)-H(AC)Mutual information or information gain ---

How much have A and C in common?

H(C|A) = H(C)-I(A;C)Conditional entropy - Remaining uncertaintyin C after learning A.

H(AB)Joint entropy

Interaction InformationInteraction Information

I(A;B;C) :=

I(AB;C) - I(B;C)- I(A;C)

= I(B;C|A) - I(B;C)= I(A;C|B) - I(A;C)

(Partial) history of independent reinventions: Quastler ‘53 (Info. Theory in Biology) - measure of

specificityMcGill ‘54 (Psychometrika) - interaction

informationHan ‘80 (Information & Control) - multiple mutual

informationYeung ‘91 (IEEE Trans. On Inf. Theory) - mutual

informationGrabisch&Roubens ‘99 (I. J. of Game Theory) - Banzhaf

interaction indexMatsuda ‘00 (Physical Review E) - higher-order

mutual inf.Brenner et al. ‘00 (Neural Computation) - average synergyDemšar ’02 (A thesis in machine learning) - relative

information gainBell ‘03 (NIPS02, ICA2003) - co-informationJakulin ‘02 - interaction gain

How informative are A and B together?

ApplicationsApplications: Interaction Graphs: Interaction Graphs

Information gain:

100% I(A;C)/H(C)The attribute “explains” 1.98% of label entropy

A positive interaction:

100% I(A;B;C)/H(C)The two attributes are in a synergy: treating them holistically may result in 1.85% extra uncertainty explained.

A negative interaction:

100% I(A;B;C)/H(C)The two attributes are slightly redundant:

1.15% of label uncertainty is explained by each of the two attributes.

CMC domain: the label is the ‘contraceptive method’ used by a couple.

Interaction as Attribute ProximityInteraction as Attribute Proximity

cluster “tightness”loose tight

information gain

uninformative attribute

informativeattribute

weakly interacting strongly interacting

Part-to-Whole ApproximationPart-to-Whole Approximation

Mutual information:– Whole: P(A,B)Parts: {P(A), P(B)}

– Approximation:

– Kullback-Leibler divergence as the measure of difference:

– Also applies for predictive accuracy:

)()(),(ˆ BPAPBAP

ba

BAHBHAHBAIbaP

baPbaPPPD

,

),()()();(),(ˆ),(

log),()ˆ||(

);())(||)|(( YAIYPAYPD

Kirkwood Superposition Kirkwood Superposition ApproximationApproximation

It is a closed form part-to-whole approximation, a special case of Kikuchi and mean-field approximations. is not normalized, explaining the negative interaction information. It is not optimal (loglinear models beat it).

P

Significance TestingSignificance Testing• Tries to answer the question:

“When is P much better than P’?”• It is based on the realization that even the correct

probabilistic model P can expect to make an error for a sample of finite size.

• The notion of self-loss captures the distribution of loss of the complex model (“variance”).

• The notion of approximation loss captures the loss caused by using a simpler model (“bias”).

• P is significantly better than P’ when the error made by P’ is greater than the self-loss in 99.5% of cases. The P-value can be at most 0.05.

Test-Bootstrap ProtocolTest-Bootstrap Protocol

To obtain the self-loss distribution, we perturb the test data, which is a bootstrap sample from the whole data set. As the loss function, we employ KL-divergence:

VERY similar to assuming that D(P’||P) has a χ2 distribution.

Self-LossSelf-Loss

Cross-Validation ProtocolCross-Validation Protocol

• P-values ignore the variation in approximation loss and the generalization power of a classifier.

• CV-values are based on the following perturbation procedure:

The Myth of Average PerformanceThe Myth of Average Performance

The distribution of

← interaction (complex) wins approximation(simple) wins →

How much do the mode/median/mean of the above distribution tell you about which model to select?

SummarySummary• The existence of an interaction implies the need for

a more complex model that joins the attributes.• Feature relevance is an interaction of order 2.• If there is no interaction, a complex model is

unnecessary.• Information theory provides an approximate

“algebra” for investigating interactions.• The difference between two models is a distribution,

not a scalar.• Occam’s P-Razor: Pick the simplest model among

those that are not significantly worse than the best one.