Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

37
Concept Description Concept Description and Data and Data Generalization Generalization (baseado nos slides do (baseado nos slides do livro: Data Mining: C & T) livro: Data Mining: C & T)
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

Page 1: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

Concept Description and Concept Description and Data GeneralizationData Generalization

(baseado nos slides do livro: Data (baseado nos slides do livro: Data Mining: C & T)Mining: C & T)

Page 2: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Two categories of data Two categories of data miningmining

Descriptive miningDescriptive mining: describes concepts or task-: describes concepts or task-relevant data sets in concise, summarative, relevant data sets in concise, summarative, informative, discriminative formsinformative, discriminative forms

Predictive miningPredictive mining: Based on data and analysis, : Based on data and analysis, constructs models for the database, and predicts constructs models for the database, and predicts the trend and properties of unknown datathe trend and properties of unknown data

Page 3: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

What is Concept What is Concept Description?Description?

Concept description (or class description)Concept description (or class description): : generates descriptions for characterization and generates descriptions for characterization and comparison of datacomparison of data

CharacterizationCharacterization: provides a concise and succinct : provides a concise and succinct summarization of the given collection of datasummarization of the given collection of data

Class comparison (or discrimination)Class comparison (or discrimination): provides : provides descriptions comparing two or more collections of descriptions comparing two or more collections of datadata

Page 4: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Data GeneralizationData Generalization

A process which A process which abstractsabstracts a large set of task- a large set of task-relevant data in a database from a low relevant data in a database from a low conceptual levels to higher ones.conceptual levels to higher ones.

1

2

3

4

5Conceptual levels

ApproachesApproaches::• Data cube approach(OLAP approach)• Attribute-oriented induction approach

Page 5: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Concept DescriptionConcept Description vs OLAP vs OLAP

SimilaritiesSimilarities: :

Data generalization

Presentation of data summarization at multiple levels of abstraction.

Interactive drilling, pivoting, slicing and dicing.

DifferencesDifferences::

Complex data types of the attributes and their aggregations

Automated process to find relevant attributes and generalization degree

Dimension relevance analysis and ranking when there are many relevant

dimensions.

Page 6: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Attribute-Oriented Attribute-Oriented InductionInduction

Proposed in 1989 (KDD ‘89 workshop)Proposed in 1989 (KDD ‘89 workshop) Not confined to categorical data nor particular measures.Not confined to categorical data nor particular measures. How it is done?How it is done?

Collect the task-relevant data (initial relation) using a relational database query

Perform data generalization by attribute removal or attribute generalization, based on the nb. of distinct values of each attribute.

Apply aggregation by merging identical, generalized tuples and accumulating their respective counts

Interactive presentation with users

Page 7: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Basic Principles (1) Basic Principles (1)

Data focusingData focusing: task-relevant data, including dimensions, : task-relevant data, including dimensions, and the result is the and the result is the initial (working) relationinitial (working) relation..

Attribute-removalAttribute-removal: remove attribute: remove attribute A A if there is a large set if there is a large set of distinct values for of distinct values for AA but: but: (1) there is no generalization operator on A, or

(2) A’s higher level concepts are expressed in terms of other attributes.

Attribute-generalizationAttribute-generalization: If there is a large set of distinct : If there is a large set of distinct values for values for AA, and there exists a set of generalization , and there exists a set of generalization operators onoperators on A A, then select an operator and generalize, then select an operator and generalize AA. .

Page 8: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Basic Principles (2) Basic Principles (2)

Two methods to control a generalization process:Two methods to control a generalization process:

Attribute-threshold controlAttribute-threshold control: typical 2-8, specified/default: typical 2-8, specified/default if the number of distinct values in an attribute is greater than

the att. threshold, then removal or generalization applies

Generalized relation threshold controlGeneralized relation threshold control: sets a threshold : sets a threshold for the generalized (final) relation/rule sizefor the generalized (final) relation/rule size If the number of distinct tuples in the generalized relation is

greater than the threshold, then further generalization applies

Page 9: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Basic Principles (3) Basic Principles (3)

Acummulate count or other aggregate values Acummulate count or other aggregate values : to : to provide statistical information about the data at diff. provide statistical information about the data at diff. levels of abstractionlevels of abstraction

Ex: Count value for a tuple in the initial relation is 1,

When generalizing data, n tuples in the initial relation result in groups of identical tuples merged into a single generalized tuple (count is n)

Page 10: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Basic Algorithm Basic Algorithm

1.1. InitialRelInitialRel: Query processing of task-relevant data, deriving : Query processing of task-relevant data, deriving the the initial relationinitial relation..

2.2. PreGenPreGen:: Based on the analysis of the number of distinct Based on the analysis of the number of distinct values in each attribute, determine generalization plan for values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize?each attribute: removal? or how high to generalize?

3.3. PrimeGenPrimeGen: Based on the PreGen plan, perform : Based on the PreGen plan, perform generalization to the right level to derive a “prime generalization to the right level to derive a “prime generalized relation”, accumulating the counts.generalized relation”, accumulating the counts.

4.4. PresentationPresentation: User interaction: (1) adjust levels by drilling, : User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations.visualization presentations.

Page 11: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Class Characterization:Class Characterization: Example (1) Example (1)

Describe general characteristics of graduate students Describe general characteristics of graduate students in the Big-University database (in DMQL)in the Big-University database (in DMQL)

use Big_University_DBmine characteristics as “Science_Students”in relevance to name, gender, major, birth_place, birth_date,

residence, phone#, gpafrom studentwhere status in “graduate”

Corresponding SQL statement:Corresponding SQL statement:select name, gender, major, birth_place, birth_date, residence, phone#,

gpafrom studentwhere status in {“Msc”, “MBA”, “PhD” }

Page 12: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Class Characterization: An Class Characterization: An Example (2)Example (2)

Name Gender Major Birth-Place Birth_date Residence Phone # GPA

JimWoodman

M CS Vancouver,BC,Canada

8-12-76 3511 Main St.,Richmond

687-4598 3.67

ScottLachance

M CS Montreal, Que,Canada

28-7-75 345 1st Ave.,Richmond

253-9106 3.70

Laura Lee…

F…

Physics…

Seattle, WA, USA…

25-8-70…

125 Austin Ave.,Burnaby…

420-5232…

3.83…

Removed Retained Sci,Eng,Bus

Country Age range City Removed Excl,VG,..

Initial Relation

Page 13: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Class Characterization: An Class Characterization: An Example (3)Example (3)

Gender Major Birth_region Age_range Residence GPA Count M Science Canada 20-25 Richmond Very-good 16 F Science Foreign 25-30 Burnaby Excellent 22 … … … … … … …

Prime Generalized Relation

Birth_Region

GenderCanada Foreign Total

M 16 14 30

F 10 22 32

Total 26 36 62

Cross-tab

Page 14: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Presentation of Presentation of Generalized Results (1)Generalized Results (1)

Generalized relationGeneralized relation: : Relations where some or all attributes are

generalized, with counts or other aggregation values accumulated.

Cross tabulationCross tabulation::Mapping results into cross tabulation form

(similar to contingency tables). Visualization techniques: Pie charts, bar charts,

curves, cubes, and other visual forms.

Page 15: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

PresentationPresentation——Generalized RelationGeneralized Relation

Page 16: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

PresentationPresentation——CrosstabCrosstab

Page 17: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Presentation of Generalized Presentation of Generalized Results (2)Results (2)

A generalized relation may also be represented in the form of A generalized relation may also be represented in the form of logic ruleslogic rules

Cj = target classCj = target classqqaa = a generalized tuple describing the target class = a generalized tuple describing the target class

t-weight for qt-weight for qaa: percentage of tuples of the target class from the initial : percentage of tuples of the target class from the initial working class that are covered by qworking class that are covered by qaa

range: [0, 1]

n

i

i

at

1

)count(q

)count(qweight

Page 18: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Presentation of Generalized Presentation of Generalized Results (3)Results (3)

Quantitative characteristic rulesQuantitative characteristic rules: Mapping generalized result : Mapping generalized result into characteristic rules with quantitative information into characteristic rules with quantitative information associated with itassociated with it

The disjunction of the conditions forms a The disjunction of the conditions forms a necessary necessary conditioncondition of the target class, i.e., all tuples of the target of the target class, i.e., all tuples of the target class must satisfy the conditionclass must satisfy the condition

Not a sufficient conditionNot a sufficient condition of the target class, since a tuple of the target class, since a tuple satisfying the same condition could belong to another satisfying the same condition could belong to another classclass

.%]47:["")(_%]53:["")(_)()(

tforeignxregionbirthtCanadaxregionbirthxmalexgrad

Page 19: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Attribute Relevance Attribute Relevance Analysis (1)Analysis (1)

Why?Why? Which dimensions should be included? How high level of generalization? Automatic vs. interactive Reduce # attributes; easy to understand patterns

What?What? statistical method for preprocessing data

filter out irrelevant or weakly relevant attributes retain or rank the relevant attributes

relevance related to dimensions and levels analytical characterization, analytical comparison

Page 20: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Attribute relevance Attribute relevance analysis (2)analysis (2)

How?How?1. Data Collection

2. Preliminary relevance analysis using conservative AOI

3. Analytical Generalization Use information gain analysis (e.g., entropy or other

measures) to identify highly relevant dimensions and levels.

Sort and select the most relevant dimensions and levels.

4. Attribute-oriented Induction for class description Using a less conservative threshold for AOI

Page 21: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Relevance Measures Relevance Measures

Quantitative relevance measureQuantitative relevance measure: : determines the classifying power of an determines the classifying power of an attribute within a set of data.attribute within a set of data.

MethodsMethods:: information gain (ID3)gain ratio (C4.5)gini index2 contingency table statisticsuncertainty coefficient

Page 22: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Entropy and Information Entropy and Information GainGain

S contains sS contains sii tuples of class C tuples of class Cii for i = {1, …, m} for i = {1, …, m}

Entropy or expected informationEntropy or expected information measures info measures info required to classify any arbitrary tuplerequired to classify any arbitrary tuple

EntropyEntropy of attribute A with values {a of attribute A with values {a11,a,a22,…,a,…,avv}}

Information gainedInformation gained by branching on attribute A by branching on attribute A

s

s

s

s,...,s,ssSE

im

i

im21 2

1

log)I()(

),...,(...

E(A) 1

1

mjj

v

j

mjjssI

s

ss

1

E(A))s,...,s,I(sGain(A) m 21

Page 23: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Example of Analytical Example of Analytical Characterization (1)Characterization (1)

TaskTask Mine general characteristics describing graduate

students using analytical characterization

GivenGiven attributes name, gender, major, birth_place,

birth_date, phone#, and gpa Gen(ai) = concept hierarchies on ai

Ui = attribute analytical thresholds for ai

Ti = attribute generalization thresholds for ai

R = attribute relevance threshold

Page 24: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Example of Analytical Example of Analytical Characterization (2)Characterization (2)

1.1. Data collectionData collection target class: graduate student contrasting class: undergraduate student

2.2. Analytical generalization using UAnalytical generalization using U ii attribute removal

remove name and phone# attribute generalization

generalize major, birth_place, birth_date and gpa accumulate counts

candidate relation: gender, major, birth_country, age_range and gpa

Page 25: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Example: Analytical Example: Analytical characterization (3)characterization (3)

gender major birth_country age_range gpa count

M Science Canada 20-25 Very_good 16

F Science Foreign 25-30 Excellent 22

M Engineering Foreign 25-30 Excellent 18

F Science Foreign 25-30 Excellent 25

M Science Canada 20-25 Excellent 21

F Engineering Canada 20-25 Excellent 18

Candidate relation for Target class: Graduate students (=120)gender major birth_country age_range gpa count

M Science Foreign <20 Very_good 18

F Business Canada <20 Fair 20

M Business Canada <20 Fair 22

F Science Canada 20-25 Fair 24

M Engineering Foreign 20-25 Very_good 22

F Engineering Canada <20 Excellent 24

Candidate relation for Contrasting class: Undergraduate students (=130)

Page 26: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Example: Analytical Example: Analytical characterization (4)characterization (4)

3. Relevance analysis3. Relevance analysis Calculate expected info required to classify an

arbitrary tuple

Calculate entropy of each attribute: e.g. major

99880250

130

250

130

250

120

250

120130120 2221 .loglog),I()s,I(s

For major=”Science”: S11=84 S21=42 I(s11,s21)=0.9183 For major=”Engineering”: S12=36 S22=46 I(s12,s22)=0.9892 For major=”Business”: S13=0 S23=42 I(s13,s23)=0

Number of grad students in “Science”

Number of undergrad students in “Science”

Page 27: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Example: Analytical Example: Analytical Characterization (5)Characterization (5)

Calculate expected info required to classify a Calculate expected info required to classify a given sample if S is partitioned according to the given sample if S is partitioned according to the attributeattribute

Calculate information gain for each attributeCalculate information gain for each attribute

Information gain for all attributes

78730250

42

250

82

250

126231322122111 .)s,s(I)s,s(I)s,s(IE(major)

2115021 .E(major))s,I(s)Gain(major

Gain(gender) = 0.0003

Gain(birth_country) = 0.0407

Gain(major) = 0.2115

Gain(gpa) = 0.4490

Gain(age_range) = 0.5971

Page 28: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Example: Analytical Example: Analytical characterization (5)characterization (5)

4.4. Initial working relation (WInitial working relation (W00) derivation) derivation R = 0.1 remove irrelevant/weakly relevant attributes from

candidate relation => drop gender, birth_country remove contrasting class candidate relation

5.5. Perform attribute-oriented induction on WPerform attribute-oriented induction on W00 using Tusing Tii

major age_range gpa count

Science 20-25 Very_good 16

Science 25-30 Excellent 47

Science 20-25 Excellent 21

Engineering 20-25 Excellent 18

Engineering 25-30 Excellent 18

Initial target class working relation W0: Graduate students

Page 29: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Mining Class ComparisonsMining Class Comparisons ComparisonComparison: Comparing two or more classes: Comparing two or more classes MethodMethod: :

Partition the set of relevant data into the target class and the contrasting class(es)

Generalize both classes to the same high level concepts Compare tuples with the same high level descriptions Present for every tuple its description and two measures

support - distribution within single class comparison - distribution between classes

Highlight the tuples with strong discriminant features

Relevance AnalysisRelevance Analysis:: Find attributes (features) which best distinguish different classes

Page 30: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Quantitative Quantitative Discriminant RulesDiscriminant Rules

Cj = target classCj = target class qqaa = a generalized tuple covers some tuples of = a generalized tuple covers some tuples of

target classtarget class but can also cover some tuples of contrasting class

d-weightd-weight range: [0, 1]

quantitative discriminant rule formquantitative discriminant rule form

m

i

ia

ja

)Ccount(q

)Ccount(qweightd

1

d_weight]:[dX)condition(ss(X)target_claX,

Page 31: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Example (1)Example (1) Compare the general properties between the graduate Compare the general properties between the graduate

students and the undergraduate students at the Big-students and the undergraduate students at the Big-University database, given the attributes: name, gender, University database, given the attributes: name, gender, etc (in DMQL)etc (in DMQL)

use Big_University_DBmine comparison as “Grad-vs-Undergrad”in relevance to name, gender, major, birth_place, birth_date, residence,

phone#, gpafrom “graduate_students”where status in “graduate”versus “undergraduate_students”where status in “undergraduate”analyze count%from student

Page 32: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Example (2)Example (2)

Quantitative discriminant ruleQuantitative discriminant rule

where 90/(90+210) = 30%

Status Birth_country Age_range Gpa Count

Graduate Canada 25-30 Good 90

Undergraduate Canada 25-30 Good 210

Count distribution between graduate and undergraduate students for a generalized tuple

%]30:["")("3025")(_"")(_

)(_,

dgoodXgpaXrangeageCanadaXcountrybirth

XstudentgraduateX

Page 33: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Class Description Class Description Quantitative characteristic ruleQuantitative characteristic rule

necessary Quantitative discriminant ruleQuantitative discriminant rule

sufficient Quantitative description ruleQuantitative description rule

necessary and sufficient ]w:d,w:[t...]w:d,w:[t nn111

(X)condition(X)condition

ss(X)target_claX,

n

d_weight]:[dX)condition(ss(X)target_claX,

t_weight]:[tX)condition(ss(X)target_claX,

Page 34: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Example: Quantitative Example: Quantitative Description RuleDescription Rule

Quantitative description rule for target class Quantitative description rule for target class EuropeEurope

Loc./item TV Computer Both_items

Count t-wt d-wt Count t-wt d-wt Count t-wt d-wt

Europe 80 25% 40% 240 75% 30% 320 100% 32%

N_Am 120 17.65% 60% 560 82.35% 70% 680 100% 68%

Both_ regions

200 20% 100% 800 80% 100% 1000 100% 100%

Crosstab showing associated t-weight, d-weight values and total number

(in thousands) of TVs and computers sold at AllElectronics in 1998

30%]:d75%,:[t40%]:d25%,:[t )computer""(item(X))TV""(item(X)

Europe(X)X,

Page 35: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

BibliografiaBibliografia

(Livro) (Livro) Data Mining: Concepts and Data Mining: Concepts and TechniquesTechniques, J. Han & M. Kamber, Morgan , J. Han & M. Kamber, Morgan Kaufmann, 2001 (Capítulo 5 – livro 2001, Kaufmann, 2001 (Capítulo 5 – livro 2001, Secção 3.7 – draft)Secção 3.7 – draft)

(Livro) (Livro) Machine LearningMachine Learning, T. Mitchell, , T. Mitchell, McGraw-Hill, 1997 (Secção 3.4)McGraw-Hill, 1997 (Secção 3.4)

Page 36: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Information-Theoretic Information-Theoretic ApproachApproach

Decision treeDecision tree each internal node tests an attribute each branch corresponds to an attribute value each leaf node assigns a classification

ID3 algorithmID3 algorithm build decision tree based on training objects with

known class labels to classify testing objects rank attributes with information gain measure minimal height

the least number of tests to classify an object

Page 37: Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)

2003/04Sistemas de Apoio à Decisão

(LEIC Tagus)

Top-Down Induction of Top-Down Induction of Decision TreeDecision Tree

Attributes = {Outlook, Temperature, Humidity, Wind}

Outlook

Humidity Wind

sunny rainovercast

yes

no yes

high normal

no

strong weak

yes

PlayTennis = {yes, no}