OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class...

26
OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage Ben-Gurion University of The Negev Faculty of Engineering Sciences Department of Information Systems Engineering Ma'ayan Gafny, Asaf Shabtai , Lior Rokach, Yuval Elovici

Transcript of OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class...

Page 1: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

OCCT: A One-Class Clustering Tree

for Implementing One-to-Many Data Linkage

Ben-Gurion University of The NegevFaculty of Engineering Sciences

Department of Information Systems Engineering

Ma'ayan Gafny, Asaf Shabtai ,Lior Rokach, Yuval Elovici

Page 2: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Definitions

Page 3: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

𝑇𝐴 – a given table A 𝑇𝐵 – a given table B (our goal is to link records from table 𝑇𝐴 with one or more records from 𝑇𝐵) ȁ�𝑇𝐴ȁ� – number of records in 𝑇𝐴 ȁ�𝑇𝐵ȁ� – number of records in 𝑇𝐵

A – the set of attributes of table 𝑇𝐴 where ai is the i-th attribute

|A| – denotes the number of attributes in 𝑇𝐴

B – the set of attributes of table 𝑇𝐵 where bi is the i-th attribute

|B| – denotes the number of attributes in 𝑇𝐵 𝑟(𝑎) ∈𝑇𝐴 – a record from table 𝑇𝐴 𝑟(𝑏) ∈𝑇𝐵 – a record from table 𝑇𝐵 𝑇𝐴× 𝑇𝐵 – a table that is generated by applying Cartesian product of 𝑇𝐴 and 𝑇𝐵

r=(r(a),r(b))⊆TA×TB – a record of 𝑇𝐴× 𝑇𝐵 𝑇𝐴𝐵⊆𝑇𝐴× 𝑇𝐵 – denoting the set of matching records 𝑇𝐴𝐵തതതത⊆𝑇𝐴× 𝑇𝐵 – denoting the set of non-matching records d – a node in the OCCT model Ad⊆A – the subset of attributes of TA that were already selected as splitting attributes in the path

from the root of the tree to node d. 𝑇𝐴𝐵(𝑑)⊆𝑇𝐴𝐵 – the subset of matching instances at node d of the OCCT tree

𝑆𝑝𝑙𝑖𝑡𝑎ቀ𝑇𝐴𝐵(𝑑)ቁ= 𝑇𝐴𝐵(𝑑)(𝑎) – the splitting of 𝑇𝐴𝐵(𝑑) into n subsets according to attribute a such that

∀𝑖 = 1..𝑛 𝑇𝐴𝐵(𝑑𝑖)(𝑎) = {𝑟∈𝑇𝐴𝐵(𝑑)|𝑎 = 𝑣𝑖} 𝜎𝑝(𝑇𝐴𝐵(𝑑)) – selection operator that is used to select records in 𝑇𝐴𝐵(𝑑) that satisfy the given predicate

p (in this case p is a=vi) 𝜋𝐴(𝑇𝐴𝐵ሺ𝑑ሻ) – projection operator that is used to select a subset of attributes in 𝑇𝐴𝐵(𝑑) that appear in

the attribute collection A

Definitions

Page 4: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Definitions

an … a4 a3 a2 a1

TA: TB:

bm … b4 b3 b2 b1

A = {a1,a2,a3,…,an}|A| = n

|TA| = num of records in TA

r(a) = a record from TA

B={b1,b2,b3,…,bm}|B|=m

|TB| = num of records in TB

r(b) = a record from TB

r(a) r(b)

Page 5: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Definitions

an … a4 a3 a2 a1

TA: TB:

bm … b4 b3 b2 b1

bm … b4 b3 b2 b1 an … a4 a3 a2 a1

TA x TB :

r=(r(a) , r(b))

Page 6: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Definitions

Target bm … b4 b3 b2 b1 an … a4 a3 a2 a1

match

match

match

match

no-match

no-match

no-match

no-match

TA x TB :

TAB

TAB

Page 7: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Definitions

Target bm … b4 b3 b2 b1 an … a4 a3 a2 a1

match

match

match

match

no-match

no-match

no-match

no-match

TA x TB :

TAB

TAB

Page 8: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Definitions

d

a=v 1

d1

a=v2

d2

bm … b1 an … a2 a1

v1

v1

v1

bm … b1 an … a2 a1

v2

v2

v2

Page 9: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Definitions

d1

d2

d4

d5

d3

Ad4 = {a1,a2}

Ad2 = {a1}

Ad⊆A – the subset of attributes of TA that were already

selected as splitting attributes in the path from the root of the tree to node d.

Page 10: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Running Examples

Page 11: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

The data set Customer Type Customer City Request Location Request Day Of

WeekRequest Part Of

Day Request ID

private Berlin Berlin Friday Afternoon 1

private Hamburg Hamburg Wednesday Afternoon 2

business Berlin Berlin Wednesday Morning 3

private Berlin Berlin Wednseday Morning 4

private Berlin Berlin Saturday Afternoon 5

private Berlin Berlin Thursday Morning 6

private Berlin Berlin Friday Afternoon 7

business Berlin Berlin Saturday Afternoon 8

private Berlin Berlin Saturday Afternoon 9

business Hamburg Hamburg Friday Afternoon 10

business Hamburg Hamburg Monday Afternoon 11

private Hamburg Hamburg Saturday Afternoon 12

private Berlin Berlin Monday Afternoon 13

private Bonn Berlin Monday Afternoon 14

private Berlin Berlin Monday Afternoon 15

private Bonn Bonn Saturday Morning 16

private Hamburg Hamburg Saturday Morning 17

private Hamburg Hamburg Saturday Morning 18

private Hamburg Hamburg Friday Afternoon 19

Page 12: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

The data set – cont .Customer Type Customer City Request Location Request Day Of

WeekRequest Part Of

Day Request ID

private Bonn Hamburg Friday Afternoon 20

private Berlin Hamburg Friday Morning 21

business Berlin Berlin Friday Morning 22

private Berlin Berlin Friday Morning 23

private Berlin Berlin Wednseday Afternoon 24

private Berlin Berlin Thursday Afternoon 25

business Berlin Berlin Thursday Afternoon 26

business Bonn Bonn Monday Afternoon 27

private Hamburg Bonn Monday Afternoon 28

business Berlin Bonn Monday Afternoon 29

business Bonn Bonn Wednseday Afternoon 30

private Bonn Bonn Friday Afternoon 31

Page 13: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Coarse Grained Jaccard

Page 14: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Coarse Grained Jaccard – Splitting the root of the tree

Three candidates for split:• Request location• Request day of week• Request part of day

Page 15: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

CGJ– Splitting the root of the tree

dreqLocation

!= Berlin

reqLocation = Berlin

W1 = 16/31

W3 = 6/31

W2 = 9/31

Score1=1/23

Score3=1/23

Score2=2/23

*

*

*

+

+

Score(SplitreqLocation) =0.0561d

reqLocation !=Hamburg

reqLocation = Hamburg

dreqLocation

!= Bonn

reqLocation = Bonn

Page 16: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

CGJ– Splitting the root of the tree

ddayOfWeek!=

Monday

dayOfWeek= Monday

W1 = 7/31

W3 = 3/31

W2 = 5/31

Score1=3/15

Score3=3/15

Score2=5/15

*

*

*+

+

Score(SplitdayOfWeek) =0.260

ddayOfWeek!= Wednesday

dayOfWeek= Wednesday

ddayOfWeek!=

Thursday

dayOfWeek = Thursday

W4 = 9/31Score4=5/15 *ddayOfWeek

!= Friday

dayOfWeek = Friday

W5= 7/31Score5=3/15 *ddayOfWeek

!= Friday

dayOfWeek = Friday

+

+

Page 17: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

CGJ– Splitting the root of the tree

dpartOfDay= Afternoon

partOfDay= Morning

Score1=4/23

Score(SplitpartOfDay) = 0.173

Page 18: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Coarse Grained Jaccard – Splitting the root of the tree

Three candidates for split:• Request location 0.0561• Request day of week 0.260• Request part of day 0.173

The split in the root

Page 19: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Fine Grained Jaccard

Page 20: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Fine Grained Jaccard – Splitting the root of the tree

Req. Location != Berlin

Req. Loca

tion = B

erlin

d

Page 21: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Least Probable Intersections

Page 22: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

LPI – Splitting the root of the tree

Req. Location != Berlin

Req. Loca

tion = B

erlin

d

Page 23: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Customer TypeCustomer CityRequest LocationRequest Day Of Week

Request Part Of DayRequest ID

privateBerlinBerlinFridayAfternoon

privateHamburgHamburgWednsedayAfternoon

businessBerlinBerlinWednsedayMorning

privateBerlinBerlinWednsedayMorning

privateBerlinBerlinSaturdayAfternoon

privateBerlinBerlinThursdayMorning

privateBerlinBerlinFridayAfternoon

businessBerlinBerlinSaturdayAfternoon

privateBerlinBerlinSaturdayAfternoon

businessHamburgHamburgFridayAfternoon

businessHamburgHamburgMondayAfternoon

privateHamburgHamburgSaturdayAfternoon

privateBerlinBerlinMondayAfternoon

privateBonnBerlinMondayAfternoon

privateBerlinBerlinMondayAfternoon

privateBonnBonnSaturdayMorning

privateHamburgHamburgSaturdayMorning

privateHamburgHamburgSaturdayMorning

privateHamburgHamburgFridayAfternoon

privateBonnHamburgFridayAfternoon

privateBerlinHamburgFridayMorning

businessBerlinBerlinFridayMorning

privateBerlinBerlinFridayMorning

privateBerlinBerlinWednsedayAfternoon

privateBerlinBerlinThursdayAfternoon

businessBerlinBerlinThursdayAfternoon

businessBonnBonnMondayAfternoon

privateHamburgBonnMondayAfternoon

businessBerlinBonnMondayAfternoon

businessBonnBonnWednsedayAfternoon

privateBonnBonnFridayAfternoon

Req. Location != Berlin

Req. Loca

tion = B

erlin

Page 24: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

LPI – Splitting the root of the tree

Req. Location != Berlin

Req. Loca

tion = B

erlin

d

Page 25: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Maximum Likelihood Estimation

Page 26: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

RequestLocation

Berli

nBonn

Hamburg

Cust.City

Cust. Type

Cust.City

Cust. Type

Cust.City

Cust. Type

MLE – Splitting the root of the tree

p(Cust. City|Cust. Type) p(Cust. Type|Cust. City)