Identifying Objects Using Cluster and Concept Analysis
description
Transcript of Identifying Objects Using Cluster and Concept Analysis
![Page 1: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/1.jpg)
Identifying Objects Using Cluster and Concept
Analysis
Arie van DeursenTobias Kuipers
CWI, The Netherlands
![Page 2: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/2.jpg)
Motivation
• Legacy code incomprehensible– Lack of structure
• Case: >100,000 LOC Banking System– Cobol + VSAM data files
• Customer wanted OO redesign• Data central to the system
![Page 3: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/3.jpg)
General Plan
• Find interesting data– Data selection– Candidate attributes
• Find interesting functionality– Program selection (procedure)– Candidate methods
• Combine the two– Candidate classes
![Page 4: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/4.jpg)
Input Selection
• Domain related v. Implementation specific• Persistent data stores
– Only records written to/read from file– Refine by CRUD (Create/Read/Update/Delete)– Records too big for one class
• Analysis of Program Call Graph– high fan-out: control-programs– high fan-in: low-level technical
![Page 5: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/5.jpg)
Combining Data & Functionality
• Cluster analysis -- technique for finding groups in data– Relies on metrics to compare distance between
data items• Concept analysis -- for finding groups too
– Relies on maximal subsets of data items sharing a set of features
![Page 6: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/6.jpg)
Cluster Analysis
• Calculate distance (similarity) number between all data items (record fields)
• Use clustering to find hierarchyField Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
![Page 7: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/7.jpg)
DendrogramField Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
0 1
NameTitleInitialPrefix
![Page 8: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/8.jpg)
Dendrogram0 1
NameTitleInitialPrefix
NumberNb-ExtZipcode
Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
![Page 9: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/9.jpg)
Dendrogram0 1
NameTitleInitialPrefix
NumberNb-ExtZipcode
Distance is 1
Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
![Page 10: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/10.jpg)
Dendrogram0 1
NameTitleInitialPrefix
NumberNb-ExtZipcode
CityDistance is 1
Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
![Page 11: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/11.jpg)
Dendrogram0 1
NameTitleInitialPrefix
NumberNb-ExtZipcode
City
Street
Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
![Page 12: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/12.jpg)
Dendrogram0 1
NameTitleInitialPrefix
NumberNb-ExtZipcode
City
Street
Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1
![Page 13: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/13.jpg)
Dendrogram0 1
NameTitleInitialPrefix
NumberNb-ExtZipcode
City
Street
![Page 14: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/14.jpg)
Dendrogram from Real Data0 1 2
AmountAccountOfficeName
BankCityIntAccountOfficeType
PaymentKindRelationNr
ChangeDate
TitleCdPrefixInitial
ZipCdCountyCd
StreetNr
MortSeqNrMortNr
CityStreet
Name
![Page 15: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/15.jpg)
Concept Analysis
• Relies on maximal subsets of data items sharing a set of features
• Concept analysis finds a latticeField Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x
![Page 16: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/16.jpg)
Concept LatticeField Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x
All Variablestop
bottomP1 P2 P3 P4
Set of features
Set of items(field names)
![Page 17: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/17.jpg)
Concept Lattice
top
P1
Name TitleInitial Prefix
P4
Number Nb-ExtZipcode Street City
P1 P2 P3 P4
bottom
All Variables
Field Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x
![Page 18: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/18.jpg)
Concept Lattice
top
P1
Name TitleInitial Prefix
P4
P1 P2 P3 P4
P3 P4
Street
P2 P4
City
Number Nb-ExtZipcode Street City
All Variables
bottom
Field Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x
![Page 19: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/19.jpg)
Concept Lattice
top
P1
Name TitleInitial Prefix
P4
P1 P2 P3 P4
P3 P4
Street
P2 P4
City
All Variables
Number Nb-ExtZipcode Street City
bottom
![Page 20: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/20.jpg)
Real Concept Lattice
A B C D E F
1 2
3
4
G
H
I J K L
5
M N O P
6
Q R S
T U V W X
7
8 9 10 11 12 13 14
![Page 21: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/21.jpg)
Concluding Remarks
• Variable Selection - Input filtering• Records are natural starting point in data-
intensive applications– Legacy/Cobol domain
• Records are too big: Decompose them• Cluster analysis v. Concept analysis
![Page 22: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/22.jpg)
Cluster v Concept Analysis
• Multiple partitionings– Clustering does not show all possibilities
• Items in multiple groups• Features and clusters
– Origin of cluster decision is lost• Concept more efficient computationally• Clustering needs more filtering
![Page 23: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/23.jpg)
Questions
![Page 24: Identifying Objects Using Cluster and Concept Analysis](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816804550346895ddd857a/html5/thumbnails/24.jpg)
Current Approaches
• Subsystem classification techniques– Survey, Lakhotia 97. Don’t work for Cobol, Cimitile 99
• Record as data part of a class– Newcomb & Kotik (‘95) take level 01 records, Fergen
et al (94) compare structure of records for reuse• Manual Methodology
– Sneed (‘92) provides manual methodology for migration of code, Sneed & Nyári (‘95) derive ‘OO’ documentation from legacy.