Aggregate features for relational data Claudia Perlich, Foster Provost

Aggregate features for relational data

Claudia Perlich, Foster Provost

Pat Tressel16-May-2005

Overview

• Perlich and Provost provide...– Hierarchy of aggregation methods– Survey of existing aggregation methods– New aggregation methods

• Concerned w/ supervised learning only– But much seems applicable to clustering

The issues…

• Most classifiers use feature vectors– Individual features have fixed arity– No links to other objects

• How do we get feature vectors from relational data?– Flatten it:

• Joins•Aggregation

• (Are feature vectors all there are?)

Joins

• Why consider them?– Yield flat feature vectors– Preserve all the data

• Why not use them?– They emphasize data with many

references•Ok if that’s what we want•Not ok if sampling was skewed•Cascaded or transitive joins blow up

Joins– They emphasize data with many

references:•Lots more Joes than there were before...

Joins

• Why not use them?– What if we don’t know the references?

•Try out everything with everything else•Cross product yields all combinations•Adds fictitious relationships•Combinatorial blowup

Joins– What if we don’t know the references?

Aggregates• Why use them?

– Yield flat feature vectors– No blowup in number of tuples

•Can group tuples in all related tables

– Can keep as detailed stats as desired•Not just max, mean, etc.•Parametric dists from sufficient stats•Can apply tests for grouping

– Choice of aggregates can be model-based•Better generalization• Include domain knowledge in model

choice

Aggregates

• Anything wrong with them?– Data is lost– Relational structure is lost– Influential individuals are lumped in

•Doesn’t discover critical individuals•Dominates other data

– Any choice of aggregates assumes a model•What if it’s wrong?

– Adding new data can require calculations•But can avoid issue by keeping

sufficient statistics

Taxonomy of aggregates

• Why is this useful?– Promote deliberate use of aggregates– Point out gaps in current use of

aggregates– Find appropriate techniques for each class

• Based on “complexity” due to:– Relational structure

•Cardinality of the relations (1:1, 1:n, m:n)

– Feature extraction•Computing the aggregates

– Class prediction

Taxonomy of aggregates• Formal statement of the task:

• Notation (here and on following slides):– t, tuple (from “target” table T, with main

features) – y, class (known per t if training)– Ψ, aggregation function– Φ, classification function– σ, select operation (where joins preserve t)– Ω, all tables; B, any other table, b in B– u, fields to be added to t from other tables– f, a field in u– More, that doesn’t fit on this slide

))),((,( tty

Taxonomy of aggregates• Formal statement of the task:

• Notation (here and on following slides):– Caution! Simplified from what’s in the paper!– t, tuple (from “target” table T, with main

features) – y, class (known per t if training)– Ψ, aggregation function– Φ, classification function– σ, select operation (where joins preserve t)– Ω, all tables; B, any other table, b a tuple in B– u, fields to be added to t from joined tables– f, a field in u– More, that doesn’t fit on this slide

))),((,( tty

Aggregation complexity

• Simple – One field from one object type

• Denoted by: s


• Multi-dimensional– Multiple fields, one object type

• Denoted by:

m


• Multi-type– Multiple

object types

• Denoted by:

t

Relational “concept” complexity

• Propositional– No aggregation– Single tuple, 1-1 or n-1 joins

•n-1 is just a shared object

– Not relational per se – already flat

))(,(

))(,(

)(

)1:(..*.

)1:1(..*.

BTty

BTty

ty

nTkeybkeytB

TkeybkeytB


• Independent fields– Separate aggregation per field– Separate 1-n joins with T

)))((,( ):1(... BTty nTkeybkeytiBs


• Dependent fields in same table– Multi-dimensional aggregation– Separate 1-n joins with T

)))((,( ):1(..*. BTty nTkeybkeytBm


• Dependent fields over multiple tables– Multi-type aggregation– Separate 1-n joins, still only with T

)))(

),((,(

):1(..*.

):1(..*.

BT

ATty

nTkeybkeytB

nTkeyakeytAt


• Global– Any joins or combinations of fields

•Multi-type aggregation•Multi-way joins• Joins among tables other than T

)))((,( * tty

Current relational aggregation

• First-order logic– Find clauses that directly predict the

class•Ф is OR

– Form binary features from tests•Logical and arithmetic tests•These go in the feature vector•Ф is any ordinary classifier


• The usual database aggregates– For numerical values:

•mean, min, max, count, sum, etc.

– For categorical values:•Most common value•Count per value


• Set distance– Two tuples, each with a set of related

tuples– Distance metric between related fields

•Euclidean for numerical data•Edit distance for categorical

– Distance between sets is distance of closest pair

Proposed relational aggregation• Recall the point of this work:

– Tuple t from table T is part of a feature vector

– Want to augment w/ info from other tables– Info added to t must be consistent w/ values

in t– Need to flatten the added info to yield one

vector per tuple t– Use that to:

•Train classifier given class y for t•Predict class y for t

Proposed relational aggregation• Outline of steps:

– Do query to get more info u from other tables

– Partition the results based on:•Main features t•Class y•Predicates on t

– Extract distributions over results for fields in u•Get distribution for each partition•For now, limit to categorical fields•Suggest extension to numerical fields

– Derive features from distributions

Do query to get info from other tables• Select

– Based on the target table T– If training, known class y is included in T– Joins must preserve distinct values from T

• Join on as much of T’s key as is present in other table

• Maybe need to constrain other fields?• Not a problem for correctly normalized tables

• Project– Include all of t– Append additional fields u from joined tables

• Anything up to all fields from joins

Extract distributions• Partition query results various ways,

e.g.:– Into cases per each t

• For training, include the (known) class y in t

– Also (if training) split per each class• Want this for class priors

– Split per some (unspecifed) predicate c(t)

• For each partition:– There is a bag of associated u tuples

• Ignore the t part – already a flat vector

– Split vertically to get bags of individual values per each field f in u

• Note this breaks association between fields!

Distributions for categorical fields• Let categorical field be f with values fi

• Form histogram for each partition– Count instances of each value fi of f in a bag

– These are sufficient statistics for:•Distribution over fi values

•Probability of each bag in the partition

• Start with one per each tuple t and field f– Cf

t, (per-) case vector

– Component Cft[i], count for fi

Distributions for categorical fields• Distribution of histograms per

predicatec(t) and field f– Treat histogram counts as random

variables•Regard c(t) true partition as a collection

of histogram “samples”•Regard histograms as vectors of random

variables, one per field value fi

– Extract moments of these histogram count distributions•mean (sort of) – reference vector•variance (sort of) – variance vector

Distributions for categorical fields• Net histogram per predicate c(t), field f

– c(t) partitions tuples t into two groups•Only histogram the c(t) true group•Could include ~c as a predicate if we want

– Don’t re-count!•Already have histograms for each t and f –

case reference vectors•Sum the case reference vectors columnwise

– Call this a “reference vector”, Rfc

•Proportional to average histogram over t for c(t) true (weighted by # samples per t)

Distributions for categorical fields

• Variance of case histograms per predicatec(t) and field f– Define “variance vector”, Vf

c

•Columnwise sum of squares of case reference vectors / number of samples with c(t) true

•Not an actual variance– Squared means not subtracted

•Don’t care:– It’s indicative of the variance...– Throw in means-based features as well to give

classifier full variance info

Distributions for categorical fields

• What predicates might we use?– Unconditionally true, c(t) = true

•Result is net distribution independent of t

•Unconditional reference vector, R

– Per class k, ck(t) = (t.y == k)•Class priors•Recall for training data, y is a field in t•Per class reference vector,

Distributions for categorical fields• Summary of notation

– c(t), a predicate based on values in a tuple t– f, a categorical field from a join with T– fi, values of f– Rf

c, reference vector• histogram over fi values in bag for c(t)

true

– Cft, case vector

• histogram over fi values for t’s bag

– R, unconditional reference vector– Vf

c, variance vector• Columnwise average squared ref. vector

– X[i], i th value in some ref. vector X

Distributions for numerical data

• Same general idea – representative distributions per various partitions

• Can use categorical techniques if we:– Bin the numerical values– Treat each bin as a categorical value

Feature extraction• Base features on ref. and variance vectors• Two kinds:

– “Interesting” values•one value from case reference vector

per t•same column in vector for all t•assorted options for choosing column•choices depend on predicate ref. vectors

– Vector distances•distance between case ref. vector and

predicate ref. vector•various distance metrics

• More notation: acronym for each feature type

Feature extraction: “interesting” values• For a given c, f, select that fi which

is...– MOC: Most common overall

• argmaxi R[i]

– Most common in each class• For binary class y

– Positive is y = 1, Negative is y = 0

• MOP: argmaxi Rft.y=1[i]

• MON: argmaxi Rft.y=0[i]

– Most distinctive per class• Common in one class but not in other(s)• MOD: argmaxi |Rf

t.y=1[i] - Rft.y=0[i] |

• MOM: argmaxi MOD / Vft.y=1[i] - Vf

t.y=0[i] – Normalizes for variance (sort of)

Feature extraction: vector distance• Distance btw given ref. vector & each case

vector• Distance metrics

– ED: Edit – not defined•Sum of abs. diffs, a.k.a. Manhattan dist?• Σi |C[i] – R[i] |

– EU: Euclidean• √(C[i] T R[i] ), omit √ for speed

– MA: Mahalanobis• √(C[i] T Σ-1 R[i] ), omit √ for speed• Σ should be covariance...of what?

– CO: Cosine, 1- cos(angle btw vectors)•1 - C[i] T R[i] / √ (|C[i] ||R[i] |)

Feature extraction: vector distance• Apply each metric w/ various ref.

vectors– Acronym is metric w/ suffix for ref.

vector– (No suffix): Unconditional ref. vector

– P: per-class positive ref. vector, Rft.y=1

– N: per-class positive ref. vector, Rft.y=0

– D: difference between P and D distances

• Alphabet soup, e.g. EUP, MAD,...

Feature extraction• Other features added for tests

– Not part of their aggregation proposal– AH: “abstraction hierarchy” (?)

•Pull into T all fields that are just “shared records” via n:1 references

– AC: “autocorrelation” aggregation•For joins back into T, get other cases

“linked to” each t•Fraction of positive cases among

others

Learning• Find linked tables

– Starting from T, do breadth-first walk of schema graph•Up to some max depth•Cap number of paths followed

– For each path, know T is linked to last table in path

• Extract aggregate fields– Pull in all fields of last table in path– Aggregate them (using new aggregates)

per t– Append aggregates to t

Learning• Classifier

– Pick 10 subsets each w/ 10 features•Random choice, weighted by

“performance”•But there’s no classifier yet...so how do

features predict class?

– Build a decision tree for each feature set•Have class frequencies at leaves

– Features might not completely distinguish classes

•Class prediction:– Select class with higher frequency

•Class probability estimation:– Average frequencies over trees

Tests• IPO data

– 5 tables•Most fields in the “main” table, used as T•Other tables had key & one data field•Predicate on one field in T used as the

class

• Tested against:– First-order logic aggregation

•Extract clauses using an ILP system•Append evaluated clauses to each t

– Various ILP systems•Using just data in T• (Or T and AH features?)

Tests• IPO data

– 5 tables w/ small, simple schema• Majority of fields were in the “main” table, i.e. T

– The only numeric fields were in main table, so no aggregation of numeric features needed

• Other tables had key & one data field• Max path length 2 to reach all tables, no

recursion• Predicate on one field in T used as the class

• Tested against:– First-order logic aggregation

• Extract clauses using an ILP system• Append evaluated clauses to each t

– Various ILP systems• Using just data in T (or T and AH features?)

Test results• See paper for numbers• Accuracy with aggregate features:

– Up to 10% increase over only features from T– Depends on which and how many extra

features used– Most predictive feature was in a separate table– Expect accuracy increase as more info

available– Shows info was not destroyed by aggregation– Vector distance features better

• Generalization

Interesting ideas (“I”) & benefits (“B”)• Taxonomy

– I: Division into stages of aggregation• Slot in any procedure per stage• Estmate complexity per stage

– B: Might get the discussion going

• Aggregate features– I: Identifying a “main” table

• Others get aggregated

– I: Forming partitions to aggregate over • Using queries with joins to pull in other tables• Abstract partitioning based on predicate

– I: Comparing case against reference histograms

– I: Separate comparison method and reference

Interesting ideas (“I”) & benefits (“B”)• Learning

– I: Decision tree tricks•Cut DT induction off short to get class

freqs•Starve DT of features to improve

generalization

Issues• Some worrying lapses...

– Lacked standard terms for common concepts

• “position i [of vector has] the number of instances of [ith value]”... -> histogram

• “abstraction hierarchy” -> schema• “value order” -> enumeration• Defined (and emphasized) terms for trivial and

commonly used things

– Imprecise use of terms• “variance” for (something like) second moment• I’m not confident they know what Mahalanobis

distance is• They say “left outer join” and show inner join

symbol


– Did not connect “reference vector” and “variance vector” to underlying statistics

• Should relate to bag prior and field value conditional probability, not just “weighted”

– Did not acknowledge loss of correlation info from splitting up joined u tuples in their features

• Assumes fields are independent• Dependency was mentioned in the taxonomy

– Fig 1 schema cannot support § 2 example query

• Missing a necessary foreign key reference


– Their formal statement of the task did not show aggregation as dependent on t•Needed for c(t) partitioning

– Did not clearly distinguish when t did or did not contain class•No need to put it in there at all

– No, the higher Gaussian moments are not all zero!•Only the odd ones are. Yeesh.•Correct reason we don’t need them is:

all can be computed from mean and variance

– Uuugly notation


– Did not cite other uses of histograms or distributions extracted as features•“Spike-triggered average” / covariance /

etc.– Used by: all neurobiology, neurocomputation– E.g.: de Ruyter van Steveninck & Bialek

•“Response-conditional ensemble”– Used by: Our own Adrienne Fairhall &

colleagues– E.g.: Aguera & Arcas, Fairhall, Bialek

•“Event-triggered distribution”– Used by: me ☺– E.g.: CSE528 project


– Did not cite other uses of histograms or distributions extracted as features...

– So, did not use “standard” tricks•Dimension reduction:

– Treat histogram as a vector– Do PCA, keep top few eigenmodes, new

features are projections

– Nor “special” tricks:•Subtract prior covariance before PCA

– Likewise competing the classes is not new

Issues• Non-goof issues

– Would need bookkeeping to maintain variance vector for online learning

• Don’t have sufficient statistics• Histograms are actual “samples” • Adding new data doesn’t add new “samples”:

changes existing ones• Could subtract old contribution, add new one• Use a triggered query

– Don’t bin those nice numerical variables!• Binning makes vectors out of scalars• Scalar fields can be ganged into a vector across

fields!• Do (e.g.) clustering on the bag of vectors

• That’s enough of that

Aggregate features for relational data Claudia Perlich, Foster Provost

Documents

Transcript of Aggregate features for relational data Claudia Perlich, Foster Provost