Identifying Objects Using Cluster and Concept Analysis

24
Identifying Objects Using Cluster and Concept Analysis Arie van Deursen Tobias Kuipers CWI, The Netherlands

description

Identifying Objects Using Cluster and Concept Analysis. Arie van Deursen Tobias Kuipers CWI, The Netherlands. Motivation. Legacy code incomprehensible Lack of structure Case: >100,000 LOC Banking System Cobol + VSAM data files Customer wanted OO redesign Data central to the system. - PowerPoint PPT Presentation

Transcript of Identifying Objects Using Cluster and Concept Analysis

Page 1: Identifying Objects  Using Cluster and Concept Analysis

Identifying Objects Using Cluster and Concept

Analysis

Arie van DeursenTobias Kuipers

CWI, The Netherlands

Page 2: Identifying Objects  Using Cluster and Concept Analysis

Motivation

• Legacy code incomprehensible– Lack of structure

• Case: >100,000 LOC Banking System– Cobol + VSAM data files

• Customer wanted OO redesign• Data central to the system

Page 3: Identifying Objects  Using Cluster and Concept Analysis

General Plan

• Find interesting data– Data selection– Candidate attributes

• Find interesting functionality– Program selection (procedure)– Candidate methods

• Combine the two– Candidate classes

Page 4: Identifying Objects  Using Cluster and Concept Analysis

Input Selection

• Domain related v. Implementation specific• Persistent data stores

– Only records written to/read from file– Refine by CRUD (Create/Read/Update/Delete)– Records too big for one class

• Analysis of Program Call Graph– high fan-out: control-programs– high fan-in: low-level technical

Page 5: Identifying Objects  Using Cluster and Concept Analysis

Combining Data & Functionality

• Cluster analysis -- technique for finding groups in data– Relies on metrics to compare distance between

data items• Concept analysis -- for finding groups too

– Relies on maximal subsets of data items sharing a set of features

Page 6: Identifying Objects  Using Cluster and Concept Analysis

Cluster Analysis

• Calculate distance (similarity) number between all data items (record fields)

• Use clustering to find hierarchyField Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

Page 7: Identifying Objects  Using Cluster and Concept Analysis

DendrogramField Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

0 1

NameTitleInitialPrefix

Page 8: Identifying Objects  Using Cluster and Concept Analysis

Dendrogram0 1

NameTitleInitialPrefix

NumberNb-ExtZipcode

Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

Page 9: Identifying Objects  Using Cluster and Concept Analysis

Dendrogram0 1

NameTitleInitialPrefix

NumberNb-ExtZipcode

Distance is 1

Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

Page 10: Identifying Objects  Using Cluster and Concept Analysis

Dendrogram0 1

NameTitleInitialPrefix

NumberNb-ExtZipcode

CityDistance is 1

Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

Page 11: Identifying Objects  Using Cluster and Concept Analysis

Dendrogram0 1

NameTitleInitialPrefix

NumberNb-ExtZipcode

City

Street

Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

Page 12: Identifying Objects  Using Cluster and Concept Analysis

Dendrogram0 1

NameTitleInitialPrefix

NumberNb-ExtZipcode

City

Street

Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

Page 13: Identifying Objects  Using Cluster and Concept Analysis

Dendrogram0 1

NameTitleInitialPrefix

NumberNb-ExtZipcode

City

Street

Page 14: Identifying Objects  Using Cluster and Concept Analysis

Dendrogram from Real Data0 1 2

AmountAccountOfficeName

BankCityIntAccountOfficeType

PaymentKindRelationNr

ChangeDate

TitleCdPrefixInitial

ZipCdCountyCd

StreetNr

MortSeqNrMortNr

CityStreet

Name

Page 15: Identifying Objects  Using Cluster and Concept Analysis

Concept Analysis

• Relies on maximal subsets of data items sharing a set of features

• Concept analysis finds a latticeField Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x

Page 16: Identifying Objects  Using Cluster and Concept Analysis

Concept LatticeField Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x

All Variablestop

bottomP1 P2 P3 P4

Set of features

Set of items(field names)

Page 17: Identifying Objects  Using Cluster and Concept Analysis

Concept Lattice

top

P1

Name TitleInitial Prefix

P4

Number Nb-ExtZipcode Street City

P1 P2 P3 P4

bottom

All Variables

Field Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x

Page 18: Identifying Objects  Using Cluster and Concept Analysis

Concept Lattice

top

P1

Name TitleInitial Prefix

P4

P1 P2 P3 P4

P3 P4

Street

P2 P4

City

Number Nb-ExtZipcode Street City

All Variables

bottom

Field Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x

Page 19: Identifying Objects  Using Cluster and Concept Analysis

Concept Lattice

top

P1

Name TitleInitial Prefix

P4

P1 P2 P3 P4

P3 P4

Street

P2 P4

City

All Variables

Number Nb-ExtZipcode Street City

bottom

Page 20: Identifying Objects  Using Cluster and Concept Analysis

Real Concept Lattice

A B C D E F

1 2

3

4

G

H

I J K L

5

M N O P

6

Q R S

T U V W X

7

8 9 10 11 12 13 14

Page 21: Identifying Objects  Using Cluster and Concept Analysis

Concluding Remarks

• Variable Selection - Input filtering• Records are natural starting point in data-

intensive applications– Legacy/Cobol domain

• Records are too big: Decompose them• Cluster analysis v. Concept analysis

Page 22: Identifying Objects  Using Cluster and Concept Analysis

Cluster v Concept Analysis

• Multiple partitionings– Clustering does not show all possibilities

• Items in multiple groups• Features and clusters

– Origin of cluster decision is lost• Concept more efficient computationally• Clustering needs more filtering

Page 23: Identifying Objects  Using Cluster and Concept Analysis

Questions

Page 24: Identifying Objects  Using Cluster and Concept Analysis

Current Approaches

• Subsystem classification techniques– Survey, Lakhotia 97. Don’t work for Cobol, Cimitile 99

• Record as data part of a class– Newcomb & Kotik (‘95) take level 01 records, Fergen

et al (94) compare structure of records for reuse• Manual Methodology

– Sneed (‘92) provides manual methodology for migration of code, Sneed & Nyári (‘95) derive ‘OO’ documentation from legacy.