An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

53
1 Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and T echnology An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data Advisor : Dr. Hsu Presenter : Jing-Wei Lin

description

An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data. Advisor : Dr. Hsu Presenter : Jing-Wei Lin. Outline. Motivation Objective Introduction SOM and AOI GSOM and EAOI Exploratory clustering and pattern extraction Experimental results Conclusions. - PowerPoint PPT Presentation

Transcript of An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

Page 1: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

1Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

Advisor : Dr. Hsu

Presenter : Jing-Wei Lin

Page 2: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

2

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Outline

Motivation Objective Introduction SOM and AOI GSOM and EAOI Exploratory clustering and pattern extraction Experimental results Conclusions

Page 3: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

3

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Motivation A successful integration relies on appropriate individual techniques. However, the traditional self-organizing ma

p (SOM) and attribute-oriented induction (AOI) have some drawbacks.

The traditional self-organizing map (SOM) is incapable of

directly handling the categorical data.

The attribute-oriented induction (AOI) may fail to preserve major values of an attribute, leading to over generalization.比如說:台北的薪資資料有 30 筆、桃園和新竹的資料各一筆而使用 AOI 處理後代表北台灣的所得的值可能會使人發生誤會

Page 4: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

4

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

SOM The SOM is an unsupervised neural network which projects high-dimensional data onto a low-dimensional grid, usually two-dimensional, and preserves the topological relationships of the original data.

Page 5: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

5

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.AOI

Attribute-oriented induction extracts data patterns in a large amount of data and produces a set of concise rules, which represent the general patterns hidden in the data.

註: AOI 是一個可以對關聯式資料庫進行資料特微擷取的技術

Page 6: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

6

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Objective A generalized self-organizing map (GSOM) and an extended

attribute-oriented induction (EAOI), which not only overcome

the drawbacks of their original algorithms but also provide

additional analysis capabilities.

Page 7: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

7

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Among unsupervised clustering techniques, a lot of attention has been paid to self-organizing map (SOM), which projects high-dimensional data to low-dimensional grids, without losing their topological order.

Regarding pattern extraction techniques, attribute-oriented induction (AOI) is a popular and effective approach.

Introduction

Page 8: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

8

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

The integrated analysis framework works as follows: train the

GSOM using preprocessed data, perform data clustering visually and exploratory on the trained map, and then extract the characteristics of individual clusters using the EAOI.

Introduction (cont.)

Page 9: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

9

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

The GSOM is able to directly handle categorical data :因為在二元轉換的過程中會造成資料損失或不完整的情況發生,故利用概念階層樹給予每一個 link 一個權重來計算出種類型資料間確切的距離

Introduction (cont.)

Page 10: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

10

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

The EAOI offers the additional capability of preserving major values in the data : 即在傳統的 AOI 法中,另外考慮了『重複次數』來獲得特徵值的分佈程度,並針對種

類 型資料也提出『主要特徵』的指標來解決太過一般化的問題。 EAOI :

Introduction (cont.)

Page 11: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

11

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Training on SOM essentially involves two steps : The identifying : each training pattern compares with all the units of the map and identifies the best matching unit (BMU) that is most similar to the training pattern.

The adjusting : the BMU and its neighbors are updated to resemble the training pattern.

SOM

Page 12: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

12

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Problem with the SOM

The conventional SOM can not directly handle categorical attributes.

The binary transformation approach has at least four disadvantages. (1) Similarity information among categorical values is not conveyed (2) When the domain of a categorical attribute is large, the

transformation increases the dimensionality of the transformed

relation (3) Maintenance is difficult (4) The names of binary attributes fail to preserve the semantics of the

original

categorical attribute

Page 13: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

13

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.AOI

The induction method mainly includes two steps,

attribute removal and attribute generalization Attribute removal:相異資料過大的欄位、意義重覆的欄位將被移除 Attribute generalization : for each remaining attribute, the original at

tribute values, which are more specific, are replaced by the values closer to the root of its concept hierarchy, which are more general.

Page 14: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

14

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Problems with handling major value and numeric attributes The traditional AOI is incapable of revealing major values and suffers from discretizing numeric attributes.

Regarding the construction of concept hierarchies for numeric attributes, there are two problems:

(1) subjectivity of the construction :因概念階層建立的標準,造成相似的資料被區分到不同的類別去,因為標準是由人主觀給定的

(2) The generalization of boundary values :如:當標準設為 50—100為中階時而 49.9 和 50 僅只有小小差異卻被分到低階去

Page 15: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

15

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.GSOM-Distance hierarchy

To alleviate the drawbacks resulting from binary

transformation, we propose distance hierarchy.

A concept hierarchy extended with weights, as the

mechanism to facilitate the representation and measurement

of the distance between categorical values.

Page 16: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

16

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.GSOM-Distance hierarchy (cont.)

The least common ancestor of two points X and Y, denoted

as LCA(X, Y)

i.e., LCA(X, Z)=Drink.

Page 17: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

17

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.GSOM-Distance hierarchy (cont.)

The least common point of two points X and Y, denoted as

LCP(X, Y), is defined as one of the three cases:

(1) either X or Y if they are at the same position (i.e.,

equivalent);

(2) Y if Y is an ancestor of X; otherwise

(3) LCP(X, Y)=LCA(X, Y)

Page 18: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

18

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.GSOM-Distance hierarchy (cont.)

The distance between two points in a distance hierarchy is

the total weight between them.

Let X=(NX, dX) and Y=(NY, dY) be the two points, the

distance between X and Y is defined as

註: d=offset represents the distance from the root of the

hierarchy to X.

Page 19: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

19

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

For example, assume that a two-dimensional pattern

x=(x1,x2)=(Coke, 9), Dom(x2)=[5, 20], and distance

hierarchies dh1 and dh2 are given as shown in Fig.

x1=Coke is mapped to X=(Coke, 2) in dh1. x2=9 is mapped

to X=(MAX, 4) in dh2.

種類型 數值型

註: dhi=Xi-Leaf distance

GSOM-Distance between a pattern and a map unit (cont.)

Page 20: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

20

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.GSOM-Distance between a pattern and a map unit (cont.) Assume a unit m consists of n components,

m=[m1, m2, …, mn]

Each mi, which can be categorical or numeric, is composed

of two parts: (N, d). For the categorical

That is, mi =(N, d) is mapped to a point M with the value

(N, d), denoted as dhi(mi)=M=(N, d), indicating the anchor of

the mapping point M is N and the offset from the root is d.

Page 21: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

21

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.GSOM-Distance between a pattern and a map unit (cont.) Suppose x, m, and dh represent a training pattern, a map unit,

and a set of distance hierarchies, respectively. Then the distance between x and m is defined as

For example, the differences between the paired mapping points of x and m are |(Coke,2)-(Coke, 0.3)|=1.7 and |(MAX, 4)-(MAX, 6)|=2, respectively, making the distance between x

and m (1.7**2+2**2)**1/2=2.62.

( 註:解決了種類型的資料不需要二元轉換即可處理 )

Page 22: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

22

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.GSOM-Adaptation of a unit component

Let X=(P, dX), M=(Q, dM), ( 德耳塔 )be the adjusting amount, and NLCA be the least common ancestor of the anchors P and Q

Case 1: new M is (Q, dM+) Case 2: new M is (P, dM+) Case 3: new M is (Q, dM- ) Case 4: new M is (P, 2dNLCA-dM+)

Page 23: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

23

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.GSOM-Adaptation of a unit component

For a numeric component, the adjusting process is simpler due to its degenerated hierarchy. Let X=(MAX, dX),

M=(MAX, dM), and be the adjusting amount. If dM > dX ,

the new M is (MAX, dM- ), otherwise (MAX, dM+ ).

Page 24: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

24

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.EAOI For the exploration of major values, we introduced a paramete

r, majority threshold β. If some values (i.e., major values) take up a major portion (exceeding β) of an attribute, the EAOI preserves those major values and generalizes other non-major values, β is set to 1, the EAOI degenerates to the AOI. 註: 0<β<=1

EAOI 除了分群和類別兩種特徵維度外,在數值型資料裡還加入了平均數和標準差來解決傳統的 AOI 會造成資料特徵有偏誤的現象

Page 25: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

25

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.EAOI (cont.)  

Page 26: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

26

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.EAOI (cont.)

Algorithm: An EAOI algorithm for major values and

alternative processing of numeric attributes Input: A relation W with an attribute set A; a set of concept

hierarchies; generalization threshold θ and majority

threshold β. Output: A generalized relation P.

Page 27: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

27

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.EAOI (cont.)  Method:1. Determine whether to generalize numeric attributes.

2. For each attribute Ai to be generalized in W,2.1 Determine whether Ai should be removed, and if not, determine its

  minimum desired generalization   level Li in its concept hierarchy.

2.2 Construct its major-value set Mi according to θand β.

2.3 For vDom(Ai), if vMi, construct the mapping pair as (v, vLi-MLi);

  otherwise, as (v, v).

3. Derive the generalized relation P by replacing each value v by

  its mapping value and computing other aggregate   values.

Page 28: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

28

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Exploratory clustering and pattern extraction The GSOM alone is incapable of extracting clusters’ character

istics, whereas the EAOI alone will result in over generalization if the data are diversified and not clustered before generalization.

Three kinds of patterns can be analyzed: cluster characteristics, discriminant rules, and characteristic rules.

Page 29: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

29

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Exploratory clustering and pattern extraction (cont.)

Cluster Characteristics :Extracted by EAOI from each cluster Ci, cluster characteristics

can be expressed as:

For example, C1: {[(City=Taipei, Salary=(51000, 0));0.97], [(City=North_Taiwan-{Taipei}, Salary=(51000, 1000));0.03]} represents two patterns, which take up 97% and 3% supports, extracted  from C1.

Page 30: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

30

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Exploratory clustering and pattern extraction (cont.)

Discriminate rules :

For instance, If C1:{[(City=Taipei, Salary=(51000,0));0.97],

[(City=North_Taiwan-{Taipei},

Salary=(51000,1000));0.03]}{A(0.7), B(0.3)} indicates

that C1 has two patterns taking up 97% and 3%, respectively,

and these patterns imply Class A with 70% confidence or

Class B with 30% confidence.

Page 31: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

31

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Exploratory clustering and pattern extraction (cont.) Characteristic Rules: 

IF 飲料 ((birthPlace= 台中 , company= 企管 , amt=(200,3.4), (C2, 0.8)) or (birthPlace= 臺北 , company= 管理學院 - 企管 , amt=(150,2.1), (C1, 0.2))) ,表「飲料」類別中,包含兩個規則,一為 80% 屬於第二群,其特徵是台中、企管、平均購買金額與標準差分別為 200 與 3.4 ;二為 20% 屬於第一群,特徵為臺北、管理學院 - 企管、平均購買金額與標準差分別為 150 與 2.1 ,主要特徵為「企管」

Page 32: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

32

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- Synthetic data

This experiment aims to compare the results by using the conventional SOM and AOI with those of the GSOM and EAOI on a synthetic, mixed dataset.

We designed a dataset of 400 tuples, which has four attributes plus one class attribute, as shown in Table 1.

Page 33: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

33

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- Synthetic data

Page 34: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

34

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- Synthetic data

The hierarchies for attributes are shown in Fig the hierarchies of the Age and the amount are for the traditional AOI.

Page 35: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

35

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- Synthetic data

The map size is 64 units, the learning rate is a linear function

with the initial value and a neighborhood radius

function set to the side length of the map, training time T is at least 10 times of the map size.

Page 36: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

36

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- Synthetic data

GSOM SOM

Shows the training results of 12,000 training time :

Page 37: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

37

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- Synthetic data

  We further use EAOI and AOI to extract discriminate rules  for the four groups formed on   the GSOM. The parameters  are set as follows: the attribute generalization threshold θ=3  and the   majority threshold β=0.75

Page 38: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

38

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- Synthetic data

GSOM

Page 39: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

39

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset The dataset has 15 attributes including eight categorical, six numerical, and one class attributes Salary indicating whether the salary is over 50K (>50K) or less than 50K (<=50K).  

Page 40: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

40

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset

Page 41: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

41

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset We use three criteria to cluster the training results.

Page 42: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

42

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset For instance, the second criterion (d <=2.828) merges Cluster 4 and 7 of the GSOM in Fig. 10(a) and merges Cluster 1, 2, 5, 6, 10, 12 and 13 of the SOM in Fig. 10(b).

Page 43: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

43

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset The average categorical utility of a set of clusters is calculated as follows.

where P(Ai=Vij|Ck) is the conditional probability that the attribute Ai has the values Vij given the cluster Ck, and P(Ai=Vij) is the overall probability of Ai having Vij in the entire data set.

Page 44: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

44

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset We compute the ACU of categorical values of clusters formed by the three clustering criteria at the leaf level and Level 1 of the distance hierarchies, and the increased rate, as shown in

Page 45: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

45

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset The expected entropy of an attribute C in a set of clusters can be used to measure how the class values are distributed in the clusters, formula is as follows

where Vj denotes one of the possible values that C can take, |Ck| is the size of Cluster k, and |D| is the dataset size.

The chaining effect results in a reduced cluster number and the increased expected entropy

Page 46: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

46

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset The Salary class distributions in the clusters are shown in Table 6, where Cluster 4 and 1 have the largest ratios of >50K. Cluster 5, 3, and 7 have much lower ratios of >50K compared to the dataset.

We use EAOI and AOI to extract cluster patterns. The parameters were set as follows: the attribute generalization threshold θ=4 and the majority threshold β=0.75.

Page 47: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

47

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset Table 7 and 8 are referred to for a portion of the patterns from Cluster 4, 2, and 7 by both methods.

Page 48: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

48

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset

Page 49: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

49

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- Sales data

In another experiment, we used a subset of sales records of a store at a university during 4/12/1999 to 7/17/2000.

Page 50: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

50

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- Sales data

Page 51: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

51

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- Sales data

Page 52: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

52

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental results- Sales data

Page 53: An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

53

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Conclusions

The attributes participating in the training of the GSOM have

significant impact on the results due to the distance metric used

in the training algorithm.

If a class attribute is involved in the data, relevance analysis

between the class attribute and the others (or feature selection)

should be performed before training to ensure the quality of

cluster analysis.