Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian 1.

Information Theory For Data Management

Divesh Srivastava

Suresh Venkatasubramanian

1

-- Abstruse Goose (177)

Motivation

Information Theory is relevant to all of humanity...2Information Theory for Data Management - Divesh & Suresh

Background

Many problems in data management need precise reasoning about information content, transfer and loss– Structure Extraction– Privacy preservation– Schema design– Probabilistic data ?

3Information Theory for Data Management - Divesh & Suresh

Information Theory

First developed by Shannon as a way of quantifying capacity of signal channels.

Entropy, relative entropy and mutual information capture intrinsic informational aspects of a signal

Today:– Information theory provides a domain-independent way to

reason about structure in data– More information = interesting structure– Less information linkage = decoupling of structures


Tutorial Thesis

Information theory provides a mathematical framework for the quantification of information content, linkage and loss.

This framework can be used in the design of data management strategies that rely on probing the structure of information in

data.

Information Theory for Data Management - Divesh & Suresh 5

Tutorial Goals

Introduce information-theoretic concepts to DB audience Give a ‘data-centric’ perspective on information theory Connect these to applications in data management Describe underlying computational primitives

Illuminate when and how information theory might be of use in new areas of data management.


Outline

Part 1 Introduction to Information Theory Application: Data Anonymization Application: Database Design

Part 2 Review of Information Theory Basics Application: Data Integration Computing Information Theoretic Primitives Open Problems


Histograms And Discrete Distributions

x1

x2

x1

x1

x4

x2

x3

x1

X

Column of data

X f(X)

x1 4

x2 2

x3 1

x4 1

Histogram

X p(X)

x1 0.5

x2 0.25

x3 0.125

x4 0.125

Probability distribution

normalizeaggregate counts


Histograms And Discrete Distributions

x1

x2

x1

x1

x4

x2

x3

x1

X

Column of data

X f(X)

x1 4

x2 2

x3 1

x4 1

Histogram

X p(X)

x1 0.667

x2 0.2

x3 0.067

x4 0.067

Probability distribution

aggregate counts

X f(x)*w(X)

x1 4*5=20

x2 2*3=6

x3 1*2=2

x4 1*2=2

normalizereweight


From Columns To Random Variables

We can think of a column of data as “represented” by a random variable: – X is a random variable– p(X) is the column of probabilities p(X = x1), p(X = x2), and so on– Also known (in unweighted case) as the empirical distribution

induced by the column X. Notation:

– X (upper case) denotes a random variable (column)– x (lower case) denotes a value taken by X (field in a tuple)– p(x) is the probability p(X = x)


Joint Distributions

Discrete distribution: probability p(X,Y,Z)

p(Y) = ∑x p(X=x,Y) = ∑x ∑z p(X=x,Y,Z=z)

X Y Z p(X,Y,Z)

x1 y1 z1 0.125

x1 y2 z2 0.125

x1 y1 z2 0.125

x1 y2 z1 0.125

x2 y3 z3 0.125

x2 y3 z4 0.125

x3 y3 z5 0.125

x4 y3 z6 0.125

X p(X)

x1 0.5

x2 0.25

x3 0.125

x4 0.125

Y p(Y)

y1 0.25

y2 0.25

y3 0.5

X Y p(X,Y)

x1 y1 0.25

x1 y2 0.25

x2 y3 0.25

x3 y3 0.125

x4 y3 0.125


Entropy Of A Column

Let h(x) = log2 1/p(x) h(X) is column of h(x) values.

H(X) = EX[h(x)] = X p(x) log2 1/p(x)

Two views of entropy It captures uncertainty in data: high entropy, more

unpredictability It captures information content: higher entropy, more

information.

X p(X) h(X)

x1 0.5 1

x2 0.25 2

x3 0.125 3

x4 0.125 3

H(X) = 1.75 < log |X| = 2


Examples

X uniform over [1, ..., 4]. H(X) = 2 Y is 1 with probability 0.5, in [2,3,4] uniformly.

– H(Y) = 0.5 log 2 + 0.5 log 6 ~= 1.8 < 2– Y is more sharply defined, and so has less uncertainty.

Z uniform over [1, ..., 8]. H(Z) = 3 > 2– Z spans a larger range, and captures more information

X Y ZInformation Theory for Data Management - Divesh & Suresh 13

Comparing Distributions

How do we measure difference between two distributions ? Kullback-Leibler divergence:

– dKL(p, q) = Ep[ h(q) – h(p) ] = i pi log(pi/qi)

Inference mechanism

Prior belief Resulting belief


Comparing Distributions

Kullback-Leibler divergence:– dKL(p, q) = Ep[ h(q) – h(p) ] = i pi log(pi/qi)

– dKL(p, q) >= 0 – Captures extra information needed to capture p given q– Is asymmetric ! dKL(p, q) != dKL(q, p) – Is not a metric (does not satisfy triangle inequality)

There are other measures:– 2-distance, variational distance, f-divergences, …


Conditional Probability

Given a joint distribution on random variables X, Y, how much information about X can we glean from Y ?

Conditional probability: p(X|Y)– p(X = x1 | Y = y1) = p(X = x1, Y = y1)/p(Y = y1)

X Y p(X,Y) p(X|Y) p(Y|X)

x1 y1 0.25 1.0 0.5

x1 y2 0.25 1.0 0.5

x2 y3 0.25 0.5 1.0

x3 y3 0.125 0.25 1.0

x4 y3 0.125 0.25 1.0

X p(X)

x1 0.5

x2 0.25

x3 0.125

x4 0.125

Y p(Y)

y1 0.25

y2 0.25

y3 0.5


Mutual Information

Mutual information captures the difference between the joint distribution on X and Y, and the marginal distributions on X and Y.

Let i(x;y) = log p(x,y)/p(x)p(y) I(X;Y) = Ex,y[I(X;Y)] = x y p(x,y) log p(x,y)/p(x)p(y)

X Y p(X,Y) h(X,Y) i(X;Y)

x1 y1 0.25 2.0 1.0

x1 y2 0.25 2.0 1.0

x2 y3 0.25 2.0 1.0

x3 y3 0.125 3.0 1.0

x4 y3 0.125 3.0 1.0

X p(X) h(X)

x1 0.5 1.0

x2 0.25 2.0

x3 0.125 3.0

x4 0.125 3.0

Y p(Y) h(Y)

y1 0.25 2.0

y2 0.25 2.0

y3 0.5 1.0


Mutual Information: Strength of linkage I(X;Y) = H(X) + H(Y) – H(X,Y) = H(X) – H(X|Y) = H(Y) – H(Y|X) If X, Y are independent, then I(X;Y) = 0:

– H(X,Y) = H(X) + H(Y), so I(X;Y) = H(X) + H(Y) – H(X,Y) = 0 I(X;Y) <= max (H(X), H(Y))

– Suppose Y = f(X) (deterministically)– Then H(Y|X) = 0, and so I(X;Y) = H(Y) – H(Y|X) = H(Y)

Mutual information captures higher-order interactions:– Covariance captures “linear” interactions only – Two variables can be uncorrelated (covariance = 0) and have

nonzero mutual information:– X R [-1,1], Y = X2. Cov(X,Y) = 0, I(X;Y) = H(X) > 0


Information Theory: Summary

We can represent data as discrete distributions (normalized histograms)

Entropy captures uncertainty or information content in a distribution

The Kullback-Leibler distance captures the difference between distributions

Mutual information and conditional entropy capture linkage between variables in a joint distribution


Outline




Data Anonymization Using Randomization Goal: publish anonymized microdata to enable accurate ad hoc

analyses, but ensure privacy of individuals’ sensitive attributes

Key ideas: – Randomize numerical data: add noise from known distribution– Reconstruct original data distribution using published noisy data

Issues:– How can the original data distribution be reconstructed?– What kinds of randomization preserve privacy of individuals?


Data Anonymization Using Randomization Many randomization strategies proposed [AS00, AA01, EGS03]

Example randomization strategies: X in [0, 10]– R = X + μ (mod 11), μ is uniform in {-1, 0, 1}– R = X + μ (mod 11), μ is in {-1 (p = 0.25), 0 (p = 0.5), 1 (p = 0.25)}– R = X (p = 0.6), R = μ, μ is uniform in [0, 10] (p = 0.4)

Question:– Which randomization strategy has higher privacy preservation?– Quantify loss of privacy due to publication of randomized data


Data Anonymization Using Randomization X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}

Id X

s1 0

s2 3

s3 5

s4 0

s5 8

s6 0

s7 6

s8 0



Id X μ

s1 0 -1

s2 3 0

s3 5 1

s4 0 0

s5 8 1

s6 0 -1

s7 6 1

s8 0 0

→

Id R1

s1 10

s2 3

s3 6

s4 0

s5 9

s6 10

s7 7

s8 0



Id X μ

s1 0 0

s2 3 -1

s3 5 0

s4 0 1

s5 8 1

s6 0 -1

s7 6 -1

s8 0 1

→

Id R1

s1 0

s2 2

s3 5

s4 1

s5 9

s6 10

s7 5

s8 1


Reconstruction of Original Data Distribution X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}

– Reconstruct distribution of X using knowledge of R1 and μ– EM algorithm converges to MLE of original distribution [AA01]

Id X μ

s1 0 0

s2 3 -1

s3 5 0

s4 0 1

s5 8 1

s6 0 -1

s7 6 -1

s8 0 1

→

Id R1

s1 0

s2 2

s3 5

s4 1

s5 9

s6 10

s7 5

s8 1

→

Id X | R1

s1 {10, 0, 1}

s2 {1, 2, 3}

s3 {4, 5, 6}

s4 {0, 1, 2}

s5 {8, 9, 10}

s6 {9, 10, 0}

s7 {4, 5, 6}

s8 {0, 1, 2}


Analysis of Privacy [AS00]

X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}– If X is uniform in [0, 10], privacy determined by range of μ

Id X μ

s1 0 0

s2 3 -1

s3 5 0

s4 0 1

s5 8 1

s6 0 -1

s7 6 -1

s8 0 1

→

Id R1

s1 0

s2 2

s3 5

s4 1

s5 9

s6 10

s7 5

s8 1

→

Id X | R1

s1 {10, 0, 1}

s2 {1, 2, 3}

s3 {4, 5, 6}

s4 {0, 1, 2}

s5 {8, 9, 10}

s6 {9, 10, 0}

s7 {4, 5, 6}

s8 {0, 1, 2}


Analysis of Privacy [AA01]

X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}– If X is uniform in [0, 1] [5, 6], privacy smaller than range of μ

Id X μ

s1 0 0

s2 1 -1

s3 5 0

s4 6 1

s5 0 1

s6 1 -1

s7 5 -1

s8 6 1

→

Id R1

s1 0

s2 0

s3 5

s4 7

s5 1

s6 0

s7 4

s8 7

→

Id X | R1

s1 {10, 0, 1}

s2 {10, 0, 1}

s3 {4, 5, 6}

s4 {6, 7, 8}

s5 {0, 1, 2}

s6 {10, 0, 1}

s7 {3, 4, 5}

s8 {6, 7, 8}


Analysis of Privacy [AA01]

X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}– If X is uniform in [0, 1] [5, 6], privacy smaller than range of μ– In some cases, sensitive value revealed

Id X μ

s1 0 0

s2 1 -1

s3 5 0

s4 6 1

s5 0 1

s6 1 -1

s7 5 -1

s8 6 1

→

Id R1

s1 0

s2 0

s3 5

s4 7

s5 1

s6 0

s7 4

s8 7

→

Id X | R1

s1 {0, 1}

s2 {0, 1}

s3 {5, 6}

s4 {6}

s5 {0, 1}

s6 {0, 1}

s7 {5}

s8 {6}


Quantify Loss of Privacy [AA01]

Goal: quantify loss of privacy based on mutual information I(X;R)– Smaller H(X|R) more loss of privacy in X by knowledge of R– Larger I(X;R) more loss of privacy in X by knowledge of R– I(X;R) = H(X) – H(X|R)

I(X;R) used to capture correlation between X and R– p(X) is the prior knowledge of sensitive attribute X– p(X, R) is the joint distribution of X and R



Goal: quantify loss of privacy based on mutual information I(X;R)– X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}

X R1 p(X,R1) h(X,R1) i(X;R1)

5 4

5 5

5 6

6 5

6 6

6 7

X p(X) h(X)

5

6

R1 p(R1) h(R1)

4

5

6

7




X R1 p(X,R1) h(X,R1) i(X;R1)

5 4 0.17

5 5 0.17

5 6 0.17

6 5 0.17

6 6 0.17

6 7 0.17

X p(X) h(X)

5 0.5

6 0.5

R1 p(R1) h(R1)

4 0.17

5 0.34

6 0.34

7 0.17




X R1 p(X,R1) h(X,R1) i(X;R1)

5 4 0.17 2.58

5 5 0.17 2.58

5 6 0.17 2.58

6 5 0.17 2.58

6 6 0.17 2.58

6 7 0.17 2.58

X p(X) h(X)

5 0.5 1.0

6 0.5 1.0

R1 p(R1) h(R1)

4 0.17 2.58

5 0.34 1.58

6 0.34 1.58

7 0.17 2.58



Goal: quantify loss of privacy based on mutual information I(X;R)– X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}– I(X;R) = 0.33

X R1 p(X,R1) h(X,R1) i(X;R1)

5 4 0.17 2.58 1.0

5 5 0.17 2.58 0.0

5 6 0.17 2.58 0.0

6 5 0.17 2.58 0.0

6 6 0.17 2.58 0.0

6 7 0.17 2.58 1.0

X p(X) h(X)

5 0.5 1.0

6 0.5 1.0

R1 p(R1) h(R1)

4 0.17 2.58

5 0.34 1.58

6 0.34 1.58

7 0.17 2.58



Goal: quantify loss of privacy based on mutual information I(X;R)– X is uniform in [5, 6], R2 = X + μ (mod 11), μ is uniform in {0, 1}– I(X;R1) = 0.33, I(X;R2) = 0.5 R2 is a bigger privacy risk than R1

X R2 p(X,R2) h(X,R2) i(X;R2)

5 5 0.25 2.0 1.0

5 6 0.25 2.0 0.0

6 6 0.25 2.0 0.0

6 7 0.25 2.0 1.0

X p(X) h(X)

5 0.5 1.0

6 0.5 1.0

R2 p(R2) h(R2)

5 0.25 2.0

6 0.5 1.0

7 0.25 2.0



Equivalent goal: quantify loss of privacy based on H(X|R)– X is uniform in [5, 6], R2 = X + μ (mod 11), μ is uniform in {0, 1}– Intuition: we know more about X given R2, than about X given R1– H(X|R1) = 0.67, H(X|R2) = 0.5 R2 is a bigger privacy risk than R1

X R2 p(X,R2) p(X|R2) h(X|R2)

5 5 0.25 1.0 0.0

5 6 0.25 0.5 1.0

6 6 0.25 0.5 1.0

6 7 0.25 1.0 0.0

X R1 p(X,R1) p(X|R1) h(X|R1)

5 4 0.17 1.0 0.0

5 5 0.17 0.5 1.0

5 6 0.17 0.5 1.0

6 5 0.17 0.5 1.0

6 6 0.17 0.5 1.0

6 7 0.17 1.0 0.0


Quantify Loss of Privacy

Example: X is uniform in [0, 1] – R3 = e (p = 0.9999), R3 = X (p = 0.0001)– R4 = X (p = 0.6), R4 = 1 – X (p = 0.4)

Is R3 or R4 a bigger privacy risk?


Worst Case Loss of Privacy [EGS03]


I(X;R3) = 0.0001 << I(X;R4) = 0.028

X R3 p(X,R3) h(X,R3) i(X;R3)

0 e 0.49995 1.0 0.0

0 0 0.00005 14.29 1.0

1 e 0.49995 1.0 0.0

1 1 0.00005 14.29 1.0

X R4 p(X,R4) h(X,R4) i(X;R4)

0 0 0.3 1.74 0.26

0 1 0.2 2.32 -0.32

1 0 0.2 2.32 -0.32

1 1 0.3 1.74 0.26




I(X;R3) = 0.0001 << I(X;R4) = 0.028– But R3 has a larger worst case risk

X R3 p(X,R3) h(X,R3) i(X;R3)

0 e 0.49995 1.0 0.0

0 0 0.00005 14.29 1.0

1 e 0.49995 1.0 0.0

1 1 0.00005 14.29 1.0

X R4 p(X,R4) h(X,R4) i(X;R4)

0 0 0.3 1.74 0.26

0 1 0.2 2.32 -0.32

1 0 0.2 2.32 -0.32

1 1 0.3 1.74 0.26



Goal: quantify worst case loss of privacy in X by knowledge of R– Use max KL divergence, instead of mutual information

Mutual information can be formulated as expected KL divergence– I(X;R) = ∑x ∑r p(x,r)*log2(p(x,r)/p(x)*p(r)) = KL(p(X,R) || p(X)*p(R))

– I(X;R) = ∑r p(r) ∑x p(x|r)*log2(p(x|r)/p(x)) = ER [KL(p(X|r) || p(X))]– [AA01] measure quantifies expected loss of privacy over R

[EGS03] propose a measure based on worst case loss of privacy– IW(X;R) = MAXR [KL(p(X|r) || p(X))]



Example: X is uniform in [0, 1]– R3 = e (p = 0.9999), R3 = X (p = 0.0001)– R4 = X (p = 0.6), R4 = 1 – X (p = 0.4)

IW(X;R3) = max{0.0, 1.0, 1.0} > IW(X;R4) = max{0.028, 0.028}

X R3 p(X,R3) p(X|R3) i(X;R3)

0 e 0.49995 0.5 0.0

0 0 0.00005 1.0 1.0

1 e 0.49995 0.5 0.0

1 1 0.00005 1.0 1.0

X R4 p(X,R4) p(X|R4) i(X;R4)

0 0 0.3 0.6 0.26

0 1 0.2 0.4 -0.32

1 0 0.2 0.4 -0.32

1 1 0.3 0.6 0.26



Example: X is uniform in [5, 6]– R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}– R2 = X + μ (mod 11), μ is uniform in {0, 1}

IW(X;R1) = max{1.0, 0.0, 0.0, 1.0} = IW(X;R2) = {1.0, 0.0, 1.0}– Unable to capture that R2 is a bigger privacy risk than R1

X R1 p(X,R1) p(X|R1) i(X;R1)

5 4 0.17 1.0 1.0

5 5 0.17 0.5 0.0

5 6 0.17 0.5 0.0

6 5 0.17 0.5 0.0

6 6 0.17 0.5 0.0

6 7 0.17 1.0 1.0

X R2 p(X,R2) p(X|R2) i(X;R2)

5 5 0.25 1.0 1.0

5 6 0.25 0.5 0.0

6 6 0.25 0.5 0.0

6 7 0.25 1.0 1.0


Data Anonymization: Summary

Randomization techniques useful for microdata anonymization– Randomization techniques differ in their loss of privacy

Information theoretic measures useful to capture loss of privacy– Expected KL divergence captures expected privacy loss [AA01]– Maximum KL divergence captures worst case privacy loss [EGS03]– Both are useful in practice


Outline




Information Dependencies [DR00]

Goal: use information theory to examine and reason about information content of the attributes in a relation instance

Key ideas: – Novel InD measure between attribute sets X, Y based on H(Y|X)– Identify numeric inequalities between InD measures

Results:– InD measures are a broader class than FDs and MVDs– Armstrong axioms for FDs derivable from InD inequalities– MVD inference rules derivable from InD inequalities



Functional dependency: X → Y– FD X → Y holds iff t1, t2 ((t1[X] = t2[X]) (t1[Y] = t2[Y]))

X Y Z

x1 y1 z1

x1 y2 z2

x1 y1 z2

x1 y2 z1

x2 y3 z3

x2 y3 z4

x3 y3 z5

x4 y3 z6



Result: FD X → Y holds iff H(Y|X) = 0– Intuition: once X is known, no remaining uncertainty in Y

H(Y|X) = 0.5

X Y p(X,Y) p(Y|X) h(Y|X)

x1 y1 0.25 0.5 1.0

x1 y2 0.25 0.5 1.0

x2 y3 0.25 1.0 0.0

x3 y3 0.125 1.0 0.0

x4 y3 0.125 1.0 0.0

X p(X)

x1 0.5

x2 0.25

x3 0.125

x4 0.125

Y p(Y)

y1 0.25

y2 0.25

y3 0.5



Multi-valued dependency: X →→ Y– MVD X →→ Y holds iff R(X,Y,Z) = R(X,Y) R(X,Z)

X Y Z

x1 y1 z1

x1 y2 z2

x1 y1 z2

x1 y2 z1

x2 y3 z3

x2 y3 z4

x3 y3 z5

x4 y3 z6




X Y Z

x1 y1 z1

x1 y2 z2

x1 y1 z2

x1 y2 z1

x2 y3 z3

x2 y3 z4

x3 y3 z5

x4 y3 z6

X Y

x1 y1

x1 y2

x2 y3

x3 y3

x4 y3

X Z

x1 z1

x1 z2

x2 z3

x2 z4

x3 z5

x4 z6

=



Result: MVD X →→ Y holds iff H(Y,Z|X) = H(Y|X) + H(Z|X)– Intuition: once X known, uncertainties in Y and Z are independent

H(Y|X) = 0.5, H(Z|X) = 0.75, H(Y,Z|X) = 1.25

=

X Y h(Y|X)

x1 y1 1.0

x1 y2 1.0

x2 y3 0.0

x3 y3 0.0

x4 y3 0.0

X Z h(Z|X)

x1 z1 1.0

x1 z2 1.0

x2 z3 1.0

x2 z4 1.0

x3 z5 0.0

x4 z6 0.0

X Y Z h(Y,Z|X)

x1 y1 z1 2.0

x1 y2 z2 2.0

x1 y1 z2 2.0

x1 y2 z1 2.0

x2 y3 z3 1.0

x2 y3 z4 1.0

x3 y3 z5 0.0

x4 y3 z6 0.0


Database Normal Forms

Goal: eliminate update anomalies by good database design– Need to know the integrity constraints on all database instances

Boyce-Codd normal form:– Input: a set ∑ of functional dependencies– For every (non-trivial) FD R.X → R.Y ∑+, R.X is a key of R

4NF:– Input: a set ∑ of functional and multi-valued dependencies– For every (non-trivial) MVD R.X →→ R.Y ∑+, R.X is a key of R



Functional dependency: X → Y– Which design is better?

X Y Z

x1 y1 z1

x1 y1 z2

x2 y2 z3

x2 y2 z4

x3 y3 z5

x4 y4 z6

X Y

x1 y1

x2 y2

x3 y3

x4 y4

X Z

x1 z1

x1 z2

x2 z3

x2 z4

x3 z5

x4 z6

=



Functional dependency: X → Y– Which design is better?

Decomposition is in BCNF

X Y Z

x1 y1 z1

x1 y1 z2

x2 y2 z3

x2 y2 z4

x3 y3 z5

x4 y4 z6

X Y

x1 y1

x2 y2

x3 y3

x4 y4

X Z

x1 z1

x1 z2

x2 z3

x2 z4

x3 z5

x4 z6

=



Multi-valued dependency: X →→ Y– Which design is better?

X Y Z

x1 y1 z1

x1 y2 z2

x1 y1 z2

x1 y2 z1

x2 y3 z3

x2 y3 z4

x3 y3 z5

x4 y3 z6

X Y

x1 y1

x1 y2

x2 y3

x3 y3

x4 y3

X Z

x1 z1

x1 z2

x2 z3

x2 z4

x3 z5

x4 z6

=



Multi-valued dependency: X →→ Y– Which design is better?

Decomposition is in 4NF

X Y Z

x1 y1 z1

x1 y2 z2

x1 y1 z2

x1 y2 z1

x2 y3 z3

x2 y3 z4

x3 y3 z5

x4 y3 z6

X Y

x1 y1

x1 y2

x2 y3

x3 y3

x4 y3

X Z

x1 z1

x1 z2

x2 z3

x2 z4

x3 z5

x4 z6

=


Well-Designed Databases [AL03]

Goal: use information theory to characterize “goodness” of a database design and reason about normalization algorithms

Key idea: – Information content measure of cell in a DB instance w.r.t. ICs– Redundancy reduces information content measure of cells

Results:– Well-designed DB each cell has information content > 0– Normalization algorithms never decrease information content



Information content of cell c in database D satisfying FD X → Y– Uniform distribution p(V) on values for c consistent with D\c and FD– Information content of cell c is entropy H(V)

H(V62) = 2.0

X Y Z

x1 y1 z1

x1 y1 z2

x2 y2 z3

x2 y2 z4

x3 y3 z5

x4 y4 z6

V62 p(V62) h(V62)

y1 0.25 2.0

y2 0.25 2.0

y3 0.25 2.0

y4 0.25 2.0



Information content of cell c in database D satisfying FD X → Y– Uniform distribution p(V) on values for c consistent with D\c and FD– Information content of cell c is entropy H(V)

H(V22) = 0.0

X Y Z

x1 y1 z1

x1 y1 z2

x2 y2 z3

x2 y2 z4

x3 y3 z5

x4 y4 z6

V22 p(V22) h(V22)

y1 1.0 0.0

y2 0.0

y3 0.0

y4 0.0



Information content of cell c in database D satisfying FD X → Y– Information content of cell c is entropy H(V)

Schema S is in BCNF iff D S, H(V) > 0, for all cells c in D– Technicalities w.r.t. size of active domain

X Y Z

x1 y1 z1

x1 y1 z2

x2 y2 z3

x2 y2 z4

x3 y3 z5

x4 y4 z6

c H(V)

c12 0.0

c22 0.0

c32 0.0

c42 0.0

c52 2.0

c62 2.0




H(V12) = 2.0, H(V42) = 2.0

V42 p(V42) h(V42)

y1 0.25 2.0

y2 0.25 2.0

y3 0.25 2.0

y4 0.25 2.0

X Y

x1 y1

x2 y2

x3 y3

x4 y4

X Z

x1 z1

x1 z2

x2 z3

x2 z4

x3 z5

x4 z6

V12 p(V12) h(V12)

y1 0.25 2.0

y2 0.25 2.0

y3 0.25 2.0

y4 0.25 2.0




Schema S is in BCNF iff D S, H(V) > 0, for all cells c in D

X Y

x1 y1

x2 y2

x3 y3

x4 y4

X Z

x1 z1

x1 z2

x2 z3

x2 z4

x3 z5

x4 z6

c H(V)

c12 2.0

c22 2.0

c32 2.0

c42 2.0



Information content of cell c in DB D satisfying MVD X →→ Y– Information content of cell c is entropy H(V)

H(V52) = 0.0, H(V53) = 2.32

X Y Z

x1 y1 z1

x1 y2 z2

x1 y1 z2

x1 y2 z1

x2 y3 z3

x2 y3 z4

x3 y3 z5

x4 y3 z6

V52 p(V52) h(V52)

y3 1.0 0.0

V53 p(V53) h(V53)

z1 0.2 2.32

z2 0.2 2.32

z3 0.2 2.32

z4 0.0

z5 0.2 2.32

z6 0.2 2.32




Schema S is in 4NF iff D S, H(V) > 0, for all cells c in D

X Y Z

x1 y1 z1

x1 y2 z2

x1 y1 z2

x1 y2 z1

x2 y3 z3

x2 y3 z4

x3 y3 z5

x4 y3 z6

c H(V)

c12 0.0

c22 0.0

c32 0.0

c42 0.0

c52 0.0

c62 0.0

c72 1.58

c82 1.58

c H(V)

c13 0.0

c23 0.0

c33 0.0

c43 0.0

c53 2.32

c63 2.32

c73 2.58

c83 2.58




H(V32) = 1.58, H(V34) = 2.32

V34 p(V34) h(V34)

z1 0.2 2.32

z2 0.2 2.32

z3 0.2 2.32

z4 0.0

z5 0.2 2.32

z6 0.2 2.32

X Y

x1 y1

x1 y2

x2 y3

x3 y3

x4 y3

X Z

x1 z1

x1 z2

x2 z3

x2 z4

x3 z5

x4 z6

V32 p(V32) h(V32)

y1 0.33 1.58

y2 0.33 1.58

y3 0.33 1.58




Schema S is in 4NF iff D S, H(V) > 0, for all cells c in D

X Y

x1 y1

x1 y2

x2 y3

x3 y3

x4 y3

X Z

x1 z1

x1 z2

x2 z3

x2 z4

x3 z5

x4 z6

c H(V)

c12 1.0

c22 1.0

c32 1.58

c42 1.58

c52 1.58

c H(V)

c14 2.32

c24 2.32

c34 2.32

c44 2.32

c54 2.58

c64 2.58



Normalization algorithms never decrease information content– Information content of cell c is entropy H(V)

X Y Z

x1 y1 z1

x1 y2 z2

x1 y1 z2

x1 y2 z1

x2 y3 z3

x2 y3 z4

x3 y3 z5

x4 y3 z6

c H(V)

c13 0.0

c23 0.0

c33 0.0

c43 0.0

c53 2.32

c63 2.32

c73 2.58

c83 2.58




c H(V)

c14 2.32

c24 2.32

c34 2.32

c44 2.32

c54 2.58

c64 2.58

X Y Z

x1 y1 z1

x1 y2 z2

x1 y1 z2

x1 y2 z1

x2 y3 z3

x2 y3 z4

x3 y3 z5

x4 y3 z6

X Y

x1 y1

x1 y2

x2 y3

x3 y3

x4 y3

X Z

x1 z1

x1 z2

x2 z3

x2 z4

x3 z5

x4 z6

=

c H(V)

c13 0.0

c23 0.0

c33 0.0

c43 0.0

c53 2.32

c63 2.32

c73 2.58

c83 2.58


Database Design: Summary

Good database design essential for preserving data integrity

Information theoretic measures useful for integrity constraints– FD X → Y holds iff InD measure H(Y|X) = 0– MVD X →→ Y holds iff H(Y,Z|X) = H(Y|X) + H(Z|X)– Information theory to model correlations in specific database

Information theoretic measures useful for normal forms– Schema S is in BCNF/4NF iff D S, H(V) > 0, for all cells c in D– Information theory to model distributions over possible databases


Outline




Review of Information Theory Basics

Discrete distribution: probability p(X)

p(X,Y) = ∑z p(X,Y,Z=z)

X Y Z p(X,Y,Z)

x1 y1 z1 0.125

x1 y2 z2 0.125

x1 y1 z2 0.125

x1 y2 z1 0.125

x2 y3 z3 0.125

x2 y3 z4 0.125

x3 y3 z5 0.125

x4 y3 z6 0.125

X p(X)

x1 0.5

x2 0.25

x3 0.125

x4 0.125

Y p(Y)

y1 0.25

y2 0.25

y3 0.5

X Y p(X,Y)

x1 y1 0.25

x1 y2 0.25

x2 y3 0.25

x3 y3 0.125

x4 y3 0.125



Discrete distribution: probability p(X)

p(Y) = ∑x p(X=x,Y) = ∑x ∑z p(X=x,Y,Z=z)

X Y Z p(X,Y,Z)

x1 y1 z1 0.125

x1 y2 z2 0.125

x1 y1 z2 0.125

x1 y2 z1 0.125

x2 y3 z3 0.125

x2 y3 z4 0.125

x3 y3 z5 0.125

x4 y3 z6 0.125

X p(X)

x1 0.5

x2 0.25

x3 0.125

x4 0.125

Y p(Y)

y1 0.25

y2 0.25

y3 0.5

X Y p(X,Y)

x1 y1 0.25

x1 y2 0.25

x2 y3 0.25

x3 y3 0.125

x4 y3 0.125



Discrete distribution: conditional probability p(X|Y)

p(X,Y) = p(X|Y)*p(Y) = p(Y|X)*p(X)

X Y p(X,Y) p(X|Y) p(Y|X)

x1 y1 0.25 1.0 0.5

x1 y2 0.25 1.0 0.5

x2 y3 0.25 0.5 1.0

x3 y3 0.125 0.25 1.0

x4 y3 0.125 0.25 1.0

X p(X)

x1 0.5

x2 0.25

x3 0.125

x4 0.125

Y p(Y)

y1 0.25

y2 0.25

y3 0.5



Discrete distribution: entropy H(X)

h(x) = log2(1/p(x))– H(X) = ∑X=x p(x)*h(x) = 1.75

– H(Y) = ∑Y=y p(y)*h(y) = 1.5 (≤ log2(|Y|) = 1.58)

– H(X,Y) = ∑X=x ∑Y=y p(x,y)*h(x,y) = 2.25 (≤ log2(|X,Y|) = 2.32)

X Y p(X,Y) h(X,Y)

x1 y1 0.25 2.0

x1 y2 0.25 2.0

x2 y3 0.25 2.0

x3 y3 0.125 3.0

x4 y3 0.125 3.0

X p(X) h(X)

x1 0.5 1.0

x2 0.25 2.0

x3 0.125 3.0

x4 0.125 3.0

Y p(Y) h(Y)

y1 0.25 2.0

y2 0.25 2.0

y3 0.5 1.0



Discrete distribution: conditional entropy H(X|Y)

h(x|y) = log2(1/p(x|y))– H(X|Y) = ∑X=x ∑Y=y p(x,y)*h(x|y) = 0.75– H(X|Y) = H(X,Y) – H(Y) = 2.25 – 1.5

X Y p(X,Y) p(X|Y) h(X|Y)

x1 y1 0.25 1.0 0.0

x1 y2 0.25 1.0 0.0

x2 y3 0.25 0.5 1.0

x3 y3 0.125 0.25 2.0

x4 y3 0.125 0.25 2.0

X p(X) h(X)

x1 0.5 1.0

x2 0.25 2.0

x3 0.125 3.0

x4 0.125 3.0

Y p(Y) h(Y)

y1 0.25 2.0

y2 0.25 2.0

y3 0.5 1.0



Discrete distribution: mutual information I(X;Y)

i(x;y) = log2(p(x,y)/p(x)*p(y))– I(X;Y) = ∑X=x ∑Y=y p(x,y)*i(x;y) = 1.0– I(X;Y) = H(X) + H(Y) – H(X,Y) = 1.75 + 1.5 – 2.25

X Y p(X,Y) h(X,Y) i(X;Y)

x1 y1 0.25 2.0 1.0

x1 y2 0.25 2.0 1.0

x2 y3 0.25 2.0 1.0

x3 y3 0.125 3.0 1.0

x4 y3 0.125 3.0 1.0

X p(X) h(X)

x1 0.5 1.0

x2 0.25 2.0

x3 0.125 3.0

x4 0.125 3.0

Y p(Y) h(Y)

y1 0.25 2.0

y2 0.25 2.0

y3 0.5 1.0


Outline




Schema Matching

Goal: align columns across database tables to be integrated– Fundamental problem in database integration

Early useful approach: textual similarity of column names– False positives: Address ≠ IP_Address– False negatives: Customer_Id = Client_Number

Early useful approach: overlap of values in columns, e.g., Jaccard– False positives: Emp_Id ≠ Project_Id– False negatives: Emp_Id = Personnel_Number


Opaque Schema Matching [KN03]

Goal: align columns when column names, data values are opaque– Databases belong to different government bureaucracies – Treat column names and data values as uninterpreted (generic)

Example: EMP_PROJ(Emp_Id, Proj_Id, Task_Id, Status_Id)– Likely that all Id fields are from the same domain– Different databases may have different column names

W X Y Z

w2 x1 y1 z2

w4 x2 y3 z3

w3 x3 y3 z1

w1 x2 y1 z2

A B C D

a1 b2 c1 d1

a3 b4 c2 d2

a1 b1 c1 d2

a4 b3 c2 d3



Approach: build complete, labeled graph GD for each database D– Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)– Perform graph matching between GD1 and GD2, minimizing distance

Intuition:– Entropy H(X) captures distribution of values in database column X– Mutual information I(X;Y) captures correlations between X, Y– Efficiency: graph matching between schema-sized graphs



Approach: build complete, labeled graph GD for each database D– Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)

A B C D

a1 b2 c1 d1

a3 b4 c2 d2

a1 b1 c1 d2

a4 b3 c2 d3

A p(A)

a1 0.5

a3 0.25

a4 0.25

B p(B)

b1 0.25

b2 0.25

b3 0.25

b4 0.25

C p(C)

c1 0.5

c2 0.5

D p(D)

d1 0.25

d2 0.5

d3 0.25




H(A) = 1.5, H(B) = 2.0, H(C) = 1.0, H(D) = 1.5

A B C D

a1 b2 c1 d1

a3 b4 c2 d2

a1 b1 c1 d2

a4 b3 c2 d3

A h(A)

a1 1.0

a3 2.0

a4 2.0

B h(B)

b1 2.0

b2 2.0

b3 2.0

b4 2.0

C h(C)

c1 1.0

c2 1.0

D h(D)

d1 2.0

d2 1.0

d3 2.0




H(A) = 1.5, H(B) = 2.0, H(C) = 1.0, H(D) = 1.5, I(A;B) = 1.5

A B C D

a1 b2 c1 d1

a3 b4 c2 d2

a1 b1 c1 d2

a4 b3 c2 d3

A h(A)

a1 1.0

a3 2.0

a4 2.0

B h(B)

b1 2.0

b2 2.0

b3 2.0

b4 2.0

A B h(A,B) i(A;B)

a1 b2 2.0 1.0

a3 b4 2.0 2.0

a1 b1 2.0 1.0

a4 b3 2.0 2.0




A B C D

a1 b2 c1 d1

a3 b4 c2 d2

a1 b1 c1 d2

a4 b3 c2 d3

A B

DC

1.5

1.0

2.0

1.5

1.0

1.5

0.5

1.5

1.0

1.0




[KN03] uses euclidean and normal distance metrics

W X

ZY

2.0

1.0

1.5

1.5

1.0

1.5

1.0

1.0

1.5

0.5

A B

DC

1.5

1.0

2.0

1.5

1.0

1.5

0.5

1.5

1.0

1.0




W X

ZY

2.0

1.0

1.5

1.5

1.0

1.5

1.0

1.0

1.5

0.5

A B

DC

1.5

1.0

2.0

1.5

1.0

1.5

0.5

1.5

1.0

1.0


Heterogeneity Identification [DKOSV06] Goal: identify columns with semantically heterogeneous values

– Can arise due to opaque schema matching [KN03]

Key ideas: – Heterogeneity based on distribution, distinguishability of values– Use Information Theory to quantify heterogeneity

Issues:– Which information theoretic measure characterizes heterogeneity?– How do we actually cluster the data ?


Heterogeneity Identification [DKOSV06] Example: semantically homogeneous, heterogeneous columns

Customer_Id

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

Customer_Id

[email protected]

[email protected]

[email protected]

[email protected]

(908)-555-1234

973-360-0000

360-0007

(877)-807-4596



Customer_Id

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

Customer_Id

[email protected]

[email protected]

[email protected]

[email protected]

(908)-555-1234

973-360-0000

360-0007

(877)-807-4596



More semantic types in column greater heterogeneity– Only email versus email + phone

Customer_Id

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

Customer_Id

[email protected]

[email protected]

[email protected]

[email protected]

(908)-555-1234

973-360-0000

360-0007

(877)-807-4596



Customer_Id

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

(877)-807-4596

Customer_Id

[email protected]

[email protected]

[email protected]

[email protected]

(908)-555-1234

973-360-0000

360-0007

(877)-807-4596



Relative distribution of semantic types impacts heterogeneity– Mainly email + few phone versus balanced email + phone

Customer_Id

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

(877)-807-4596

Customer_Id

[email protected]

[email protected]

[email protected]

[email protected]

(908)-555-1234

973-360-0000

360-0007

(877)-807-4596



Customer_Id

187-65-2468

987-64-6837

789-15-4321

987-65-4321

(908)-555-1234

973-360-0000

360-0007

(877)-807-4596

Customer_Id

[email protected]

[email protected]

[email protected]

[email protected]

(908)-555-1234

973-360-0000

360-0007

(877)-807-4596



More easily distinguished types greater heterogeneity– Phone + (possibly) SSN versus balanced email + phone

Customer_Id

187-65-2468

987-64-6837

789-15-4321

987-65-4321

(908)-555-1234

973-360-0000

360-0007

(877)-807-4596

Customer_Id

[email protected]

[email protected]

[email protected]

[email protected]

(908)-555-1234

973-360-0000

360-0007

(877)-807-4596


Heterogeneity Identification [DKOSV06] Heterogeneity = complexity of describing the data

– More, balanced clusters greater heterogeneity– More distinguishable clusters greater heterogeneity

Use two views of mutual information1. It quantifies the complexity of describing the data (compression)2. It quantifies the quality of the compression


Heterogeneity Identification [DKOSV06] Hard clustering

X = Customer_Id T = Cluster_Id

187-65-2468 t1

987-64-6837 t1

789-15-4321 t1

987-65-4321 t1

(908)-555-1234 t2

973-360-0000 t1

360-0007 t3

(877)-807-4596 t2


Measuring complexity of clustering

Take 1: complexity of a clustering = #clusters– standard model of complexity.

Doesn’t capture the fact that clusters have different sizes.



Take 2: Complexity of clustering = number of bits needed to describe it.

Writing down “k” needs log k bits. In general, let cluster t T have |t| elements.

– set p(t) = |t|/n– #bits to write down cluster sizes = H(T) = pt log 1/pt

H( ) < H( )104Information Theory for Data Management - Divesh & Suresh

Heterogeneity Identification [DKOSV06] Soft clustering: cluster membership probabilities

How to compute a good soft clustering?

X = Customer_Id T = Cluster_Id p(T|X)

789-15-4321 t1 0.75

987-65-4321 t1 0.75

789-15-4321 t2 0.25

987-65-4321 t2 0.25

(908)-555-1234 t1 0.25

973-360-0000 t1 0.5

(908)-555-1234 t2 0.75

973-360-0000 t2 0.5



Take 1:– p(t) = x p(x) p(t|x)– Compute H(T) as before.

Problem:

H(T1) = H(T2) !!

T1 t1 t2 T2 t1 t2

x1 0.5 0.5 x1 0.99 0.01

x2 0.5 0.5 x2 0.01 0.99

h(T) 0.5 0.5 h(T) 0.5 0.5



By averaging the memberships, we’ve lost useful information. Take II: Compute I(T;X) !

Even better: If T is a hard clustering of X, then I(T;X) = H(T)

X T1 p(X,T) i(X;T)

x1 t1 0.25 0

x1 t2 0.25 0

x2 t1 0.25 0

x2 t2 0.25 0

I(T1;X) = 0

X T2 p(X,T) i(X;T)

x1 t1 0.495 0.99

x1 t2 0.005 -5.64

x2 t1 0.25 0

x2 t2 0.25 0

I(T2;X) = 0.46


Measuring cost of a cluster

Given objects Xt = {X1, X2, …, Xm} in cluster t,

Cost(t) = sum of distances from Xi to cluster center

If we set distance to be KL-distance, then

Cost of a cluster = I(Xt,V)108Information Theory for Data Management - Divesh & Suresh

Cost of a clustering

If we partition X into k clusters X1, ..., Xk

Cost(clustering) = i pi I(Xi, V) (pi = |Xi|/|X|)

Suppose we treat each cluster center itself as a point ?


Cost of a clustering

We can write down the “cost” of this “cluster”– Cost(T) = I(T;V)

Key result [BMDG05] : Cost(clustering) = I(X, V) – I(T, V)

Minimizing cost(clustering) => maximizing I(T, V)


Heterogeneity Identification [DKOSV06] Represent strings as q-gram distributions

X = Customer_Id V = 4-grams p(X,V)

987-65-4321 987- 0.10

987-65-4321 87-6 0.13

987-65-4321 7-65 0.12

987-65-4321 -65- 0.15

987-65-4321 65-4 0.05

987-65-4321 5-43 0.20

987-65-4321 -432 0.15

987-65-4321 4321 0.10

Customer_Id

187-65-2468

987-64-6837

789-15-4321

987-65-4321

(908)-555-1234

973-360-0000

360-0007

(877)-807-4596


Heterogeneity Identification [DKOSV06] iIB: find soft clustering T of X that minimizes I(T;X) – β*I(T;V)

Allow iIB to use arbitrarily many clusters, use β* = H(X)/I(X;V)– Closest to point with minimum space and maximum quality

X = Customer_Id V = 4-grams p(X,V)

987-65-4321 987- 0.10

987-65-4321 87-6 0.13

987-65-4321 7-65 0.12

987-65-4321 -65- 0.15

987-65-4321 65-4 0.05

987-65-4321 5-43 0.20

987-65-4321 -432 0.15

987-65-4321 4321 0.10

Customer_Id

187-65-2468

987-64-6837

789-15-4321

987-65-4321

(908)-555-1234

973-360-0000

360-0007

(877)-807-4596


Heterogeneity Identification [DKOSV06] Rate distortion curve: I(T;V)/I(X;V) vs I(T;X)/H(X)

β*


Heterogeneity Identification [DKOSV06] Heterogeneity = mutual information I(T;X) of iIB clustering T at β*

0 ≤I(T;X) (= 0.126) ≤ H(X) (= 2.0), H(T) (= 1.0)– Ideally use iIB with an arbitrarily large number of clusters in T

X = Customer_Id T = Cluster_Id p(T|X) i(T;X)

789-15-4321 t1 0.75 0.41

987-65-4321 t1 0.75 0.41

789-15-4321 t2 0.25 -0.81

987-65-4321 t2 0.25 -0.81

(908)-555-1234 t1 0.25 -1.17

973-360-0000 t1 0.5 -0.17

(908)-555-1234 t2 0.75 0.77

973-360-0000 t2 0.5 0.19


Heterogeneity Identification [DKOSV06] Heterogeneity = mutual information I(T;X) of iIB clustering T at β*


Data Integration: Summary

Analyzing database instance critical for effective data integration– Matching and quality assessments are key components

Information theoretic measures useful for schema matching– Align columns when column names, data values are opaque– Mutual information I(X;V) captures correlations between X, V

Information theoretic measures useful for heterogeneity testing– Identify columns with semantically heterogeneous values– I(T;X) of iIB clustering T at β* captures column heterogeneity


Outline




Domain size matters

For random variable X, domain size = supp(X) = {xi | p(X = xi) > 0}

Different solutions exist depending on whether domain size is “small” or “large”

Probability vectors usually very sparse


Entropy: Case I - Small domain size

Suppose the #unique values for a random variable X is small (i.e fits in memory)

Maximum likelihood estimator: – p(x) = #times x is encountered/total number of items in set.

1

21

4

2

51

1 2 3 4 5


Entropy: Case I - Small domain size

HMLE = x p(x) log 1/p(x) This is a biased estimate:

– E[HMLE] < H

Miller-Madow correction:– H’ = HMLE + (m’ – 1)/2n

m’ is an estimate of number of non-empty bins n = number of samples

Bad news: ALL estimators for H are biased. Good news: we can quantify bias and variance of MLE:

– Bias <= log(1 + m/N)– Var(HMLE) <= (log n)2/N


Entropy: Case II - Large domain size

|X| is too large to fit in main memory, so we can’t maintain explicit counts.

Streaming algorithms for H(X):– Long history of work on this problem– Bottomline:

(1+)-relative-approximation for H(X) that allows for updates to frequencies, and requires “almost constant”, and optimal space [HNO08].


Streaming Entropy [CCM07]

High level idea: sample randomly from the stream, and track counts of elements picked [AMS]

PROBLEM: skewed distribution prevents us from sampling lower-frequency elements (and entropy is small)

Idea: estimate largest frequency, and distribution of what’s left (higher entropy)


Streaming Entropy [CCM07]

Maintain set of samples from original distribution and distribution without most frequent element.

In parallel, maintain estimator for frequency of most frequent element– normally this is hard– but if frequency is very large, then simple estimator exists

[MG81] (Google interview puzzle!)

At the end, compute function of these two estimates Memory usage: roughly 1/2 log(1/) ( is the error)


Entropy and MI are related

I(X;Y) = H(X,Y) – H(X) – H(Y) Suppose we can c-approximate H(X) for any c > 0:

Find H’(X) s.t |H(X) – H’(X)| <= c Then we can 3c-approximate I(X;Y):

– I(X;Y) = H(X,Y) – H(X) – H(Y) <= H’(X,Y)+c – (H’(X)-c) – (H’(Y)-c) <= H’(X,Y) – H’(X) – H’(Y) + 3c

<= I’(X,Y) + 3c Similarly, we can 2c-approximate H(Y|X) = H(X,Y) – H(X) Estimating entropy allows us to estimate I(X;Y) and H(Y|X)


Computing KL-divergence: Small Domains

“easy algorithm”: maintain counts for each of p and q, normalize, and compute KL-divergence.

PROBLEM ! Suppose qi = 0:– pi log pi/qi is undefined !

General problem with ML estimators: all events not seen have probability zero !!– Laplace correction: add one to counts for each seen element– Slightly better: add 0.5 to counts for each seen element [KT81]– Even better, more involved: use Good-Turing estimator [GT53]

YIeld non-zero probability for “things not seen”.


Computing KL-divergence: Large Domains

Bad news: No good relative-approximations exist in small space.

(Partial) good news: additive approximations in small space under certain technical conditions (no pi is too small).

(Partial) good news: additive approximations for symmetric variant of KL-divergence, via sampling.

For details, see [GMV08,GIM08]


Information-theoretic Clustering

Given a collection of random variables X, each “explained” by a random variable Y, we wish to find a (hard or soft) clustering T such that

I(T,X) – I(T, Y)is minimized.

Features of solutions thus far:– heuristic (general problem is NP-hard)– address both small-domain and large-domain scenarios.


Agglomerative Clustering (aIB) [ST00] Fix number of clusters k1. While number of clusters < k

1. Determine two clusters whose merge loses the least information

2. Combine these two clusters

2. Output clustering Merge Criterion:

– merge the two clusters so that change in I(T;V) is minimized Note: no consideration of (number of clusters is fixed)


Agglomerative Clustering (aIB) [S]

Elegant way of finding the two clusters to be merged:

Let dJS(p,q) = (1/2)(dKL(p,m) + dKL(q,m)), m = (p+q)/2

dJS(p,q) is a symmetric distance between p, q (Jensen-Shannon distance)

We merge clusters that have smallest dJS(p,q), (weighted by cluster mass)

p qm


Iterative Information Bottleneck (iIB) [S] aIB yields a hard clustering with k clusters. If you want a soft clustering, use iIB (variant of EM)

– Step 1: p(t|x) ← exp(-dKL(p(V|x),p(V|t)) assign elements to clusters in proportion (exponentially) to

distance from cluster center !– Step 2: Compute new cluster centers by computing weighted

centroids: p(t) = x p(t|x) p(x) p(V|t) = x p(V|t) p(t|x) p(x)/p(t)

– Choose according to [DKOSV06]


Dealing with massive data sets

Clustering on massive data sets is a problem Two main heuristics:

– Sampling [DKOSV06]: pick a small sample of the data, cluster it, and (if necessary)

assign remaining points to clusters using soft assignment. How many points to sample to get good bounds ?

– Streaming: Scan the data in one pass, performing clustering on the fly How much memory needed to get reasonable quality

solution ?


LIMBO (for aIB) [ATMS04]

BIRCH-like idea:– Maintain (sparse) summary for each cluster (p(t), p(V|t))– As data streams in, build clusters on groups of objects– Build next-level clusters on cluster summaries from lower level


Outline




Open Problems

Data exploration and mining – information theory as first-pass filter

Relation to nonparametric generative models in machine learning (LDA, PPCA, ...)

Engineering and stability: finding right knobs to make systems reliable and scalable

Other information-theoretic concepts ? (rate distortion, higher-order entropy, ...)

THANK YOU !134Information Theory for Data Management - Divesh & Suresh

References: Information Theory

[CT] Tom Cover and Joy Thomas: Information Theory.

[BMDG05] Arindam Banerjee, Srujana Merugu, Inderjit Dhillon, Joydeep Ghosh. Learning with Bregman Divergences, JMLR 2005.

[TPB98] Naftali Tishby, Fernando Pereira, William Bialek. The Information Bottleneck Method. Proc. 37th Annual Allerton Conference, 1998.


References: Data Anonymization

[AA01] Dakshi Agrawal, Charu C. Aggarwal: On the design and quantification of privacy preserving data mining algorithms. PODS 2001.

[AS00] Rakesh Agrawal, Ramakrishnan Srikant: Privacy preserving data mining. SIGMOD 2000.

[EGS03] Alexandre Evfimievski, Johannes Gehrke, Ramakrishnan Srikant: Limiting privacy breaches in privacy preserving data mining. PODS 2003.


References: Database Design

[AL03] Marcelo Arenas, Leonid Libkin: An information theoretic approach to normal forms for relational and XML data. PODS 2003.

[AL05] Marcelo Arenas, Leonid Libkin: An information theoretic approach to normal forms for relational and XML data. JACM 52(2), 246-283, 2005.

[DR00] Mehmet M. Dalkilic, Edward L. Robertson: Information dependencies. PODS 2000.

[KL06] Solmaz Kolahi, Leonid Libkin: On redundancy vs dependency preservation in normalization: an information-theoretic study of XML. PODS 2006.


References: Data Integration

[AMT04] Periklis Andritsos, Renee J. Miller, Panayiotis Tsaparas: Information-theoretic tools for mining database structure from large data sets. SIGMOD 2004.

[DKOSV06] Bing Tian Dai, Nick Koudas, Beng Chin Ooi, Divesh Srivastava, Suresh Venkatasubramanian: Rapid identification of column heterogeneity. ICDM 2006.

[DKSTV08] Bing Tian Dai, Nick Koudas, Divesh Srivastava, Anthony K. H. Tung, Suresh Venkatasubramanian: Validating multi-column schema matchings by type. ICDE 2008.

[KN03] Jaewoo Kang, Jeffrey F. Naughton: On schema matching with opaque column names and data values. SIGMOD 2003.

[PPH05] Patrick Pantel, Andrew Philpot, Eduard Hovy: An information theoretic model for database alignment. SSDBM 2005.


References: Computing IT quantities

[P03] Liam Panninski. Estimation of entropy and mutual information. Neural Computation 15: 1191-1254.

[GT53] I. J. Good. Turing’s anticipation of Empirical Bayes in connection with the cryptanalysis of the Naval Enigma. Journal of Statistical Computation and Simulation, 66(2), 2000.

[KT81] R. E. Krichevsky and V. K. Trofimov. The performance of universal encoding. IEEE Trans. Inform. Th. 27 (1981), 199--207.

[CCM07] Amit Chakrabarti, Graham Cormode and Andrew McGregor. A near-optimal algorithm for computing the entropy of a stream. Proc. SODA 2007.


References: Computing IT quantities

[HNO] Nich Harvey, Jelani Nelson, Krzysztof Onak. Sketching and Streaming Entropy via Approximation Theory. FOCS 2008.

[ATMS04] Periklis Andritsos, Panayiotis Tsaparas, Renée J. Miller and Kenneth C. Sevcik. LIMBO: Scalable Clustering of Categorical Data. EDBT 2004.

[S] Noam Slonim. The Information Bottleneck: theory and applications. Ph.D Thesis. Hebrew University, 2000.

[GMV08] Sudipto Guha, Andrew McGregor, Suresh Venkatasubramanian. Streaming and sublinear approximations for information distances. ACM Trans Alg. 2008.

[GIM08] Sudipto Guha, Piotr Indyk, Andrew McGregor. Sketching Information Distances. JMLR, 2008.


Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian 1.

Documents

Transcript of Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian 1.