Seismic Thickness Estimation: Three Approaches, Pros and Cons Gregory A. Partyka bp.
Geographically-Typed Geospatial Data Source Matching with High- Quality Clustering and Multi-...
-
Upload
marlene-bond -
Category
Documents
-
view
235 -
download
1
Transcript of Geographically-Typed Geospatial Data Source Matching with High- Quality Clustering and Multi-...
Geographically-Typed Geospatial Data Source Matching with High-Quality Clustering and Multi-Attribute Matching
Jeffrey PartykaDr. Latifur KhanDr. Bhavani Thuraisingham
Funded by NGA & US Air Force
Topic Outline
• Problem Statement• Background Information• Matching Procedures
- Generalized Solution - N-grams - Non-Geographic Matching (NGT Matching) - Geographic Matching (GT Matching) - Attribute Weighting - High-Quality Clustering - 1:N Matching
• Experimental Results• Future Work
Motivation•Internet Architecture
▫Highly Distributed▫Federated Architecture
•Web Application Problems ▫ Low Performance for Information
Retrieval▫Accuracy of Retrieved Information
Sample Scenario
Rank Data Source
Query: Publication of Academic Staff
MIT Ontology
Karlsruhe Ontology
UMBC Ontology
{Article, Book, Booklet, InBook, InCollection, InProceedings, Manual, Misc, Proceedings, Report, Technical Report, Project Report, Thesis, Master Thesis, PhD Thesis, Unpublished, Faculty Member, Lecturer}
Different Bibliography Ontologies
MIT Ontology
Karlsruhe Ontology
UMBC Ontology
Problem Statement: Schema MatchingGiven 2 data sources, S1 and S2 , each of which is
composed of a set of tables where {T11, T12, T13…T1k…T1m} є S1 and {T21, T22, T23…T2j…T2n} є S2, with 1<= k <= m and 1 <= j <= n, determine the similarity between T1k and T2j
roadName City
Johnson Rd. Plano
School Dr. Richardson
Zeppelin St. Lakehurst
Alma Dr. Richardson
Road County
Custer Pwy Cooke
15th St. Collin
Parker Rd. Collin
Alma Dr. Collin
S1 S2
COUNTY Destination
SNOHOMISH Mukilteo
PIERCE Point Defiance
KITSAP Southworth
SNOHOMISH Edmonds
City County
Anacortes Skagit
Friday Harbor San Juan
Argyle San Juan
Kirkland King
Road Road
Given 2 ontologies, O1 and O2 , each of which is composed of a set of concepts where {C11, C12, C13…C1k…C1m} є O1 and {C21, C22, C23…C2j…C2n} є O2, with 1<= k <= m and 1 <= j <= n, determine the similarity between C1k and C2j
Problem Statement: Ontology Matching
Motivating Scenarios1 Making Complex Business
Decisions
“Should we invest in a new cholesterol drug for the Asia-Pacific region?“
2
Robust Semantic Web Applications
2
R & D
Corporate
Marketing
Regulatory Affairs
Manufacturing
Yes/No/Maybe?
“Find the group of friends around Jeff. Then find the most important person out of the group. Find out if this person was at an event of type Meeting, and happened between 9AM-11AM within 5 miles of UTD”
Jeff, Jeff’s friends
Within 5 miles of UTD
9:00am-11:00am
Yes/No/Maybe?
Social Network
Geospatial Ontology
Temporal Logic
RDFS Lookup
Event of Type ‘Meeting’
Matching ApproachesMappings may be generated in several ways – some approaches are:
(1: Name Matching
(2: Structure Matching
(3: Instance Matching
Email emailAddress
County DSP
Kitsap Kingston
Wahkiak Puget Island
COUNTYNAME CID
TRAIL RANGE DR 96
KITSAP 97
?
Some Definitions Definition 1 (attribute) An attribute of a table T,
denoted as att(T), is defined as a property of T that further describes it.
Definition 2 (instance) An instance x of an attribute att(T) is defined as a data value associated with att(T).
Definition 3 (keyword) A keyword k of an instance x associated with attribute att(T) is defined as a meaningful word (not a stopword) representing a portion of the instance.
Some Definitions (cont) Definition 4a (geographic type (GT)) A geographic
type GT associated with attribute att(T) is defined as a class of instances of att(T) that represent the same geographic feature. (e.g: “lake”, “road”)
Definition 4b (non-geographic type (NGT)) A non-geographic type (NGT) associated with attribute att(T) is defined as a group of keywords from instances of att(T) that are semantically related to each other.
Collin
Plano
Richardson
New Jersey
Trenton
Monmouth
Topic Outline
• Problem Statement• Background Information• Matching Procedures
- Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching
• Experimental Results• Future Work
Overview of Matching Algorithm
1Select attribute pairs for comparison
2
roadName
roadType city
Match instances between compared attributes
townrType rName county
roadName
rName
3
Determine final attribute similarity
K Ave.Jupiter Rd.Coit Rd.
L Ave.LBJ FreewayUS 75
roadName
rName
EBD = .98
Run Sim algorithms…
Determining Semantic Similarity
•We use Entropy-Based Distribution (EBD)•EBD is a measurement of type similarity
between 2 attributes (or columns):
•EBD takes values in the range of [0,1] . Greater EBD corresponds to more similar type distributions between compared attributes (columns)
EBD = H(C|T)
H(C)
Applying EBD to Semantic Matching
att1
X
X
X
Y
Y
Z
att2
X
X
Y
Y
Y
Z
XX
XY
YZ
YY
Y XX
Z
Y Y
XYYY X
XXX
ZZ
Entropy = H(C) =
Conditional Entropy = H(C|T) =
Topic Outline
• Problem Statement• Background Information• Matching Procedures
- Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching
• Experimental Results• Future Work
Matching Using N-grams
• Use commonly occurring N-grams [2,3] in compared attributes to determine similarity (N = 2)
StrName FENAME Status
LOCUST-GROVE DR
LOCUST GROVE
BUILT
TRAIL RANGE DR TRAIL RANGE
BUILT
Street Laddress Raddress
LOUISE -DOVER DR
1600 1798
CR45/MANET CT
2500 2598
TA
Some N-grams extracted from A.StrName = {LO, OC, CU,ST, OV…..}Some N-grams extracted from B.Street = {LO, OU, UI,
OV,…..}
TB
LOLO
OVOV
ST
UI
Conditional Entropy = H(C|T) =
[2] Jeffrey Partyka, Neda Alipanah, Latifur Khan, Bhavani M. Thuraisingham, Shashi Shekhar: Content-based ontology matching for GIS datasets. ACM SIGSPATIAL GIS 2008 (ACM GIS, Laguna Beach, California, Nov. 2008): 51.
[3] Jeffrey Partyka, Neda Alipanah, Latifur Khan, Bhavani M. Thuraisingham, Shashi Shekhar: Ontology Alignment Using Multiple Contexts. 7th International Semantic Web Conference (ISWC) Karlsruhe, Germany, Oct. 2008.
Faults of this Method• Semantically similar columns are not
guaranteed to have a high similarity score
City Country
Dallas USA
Houston USA
Kingston Jamaica
Halifax Canada
Mexico City
Mexico
ctyName country
Shanghai China
Beijing China
Tokyo Japan
New Delhi India
Kuala Lumpur
Malaysia
2-grams extracted from A: {Da, al, la, as, Ho, ou, us…}
A є T1 B є T2
2-grams extracted from B: {Sh, ha, an, ng, gh, ha, ai, Be, ei, ij…}
Topic Outline
• Problem Statement• Background Information• Matching Procedures
- Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching
• Experimental Results• Future Work
Non-Geographic Matching
Dallas USAHoustonTokyoBeijingHalifax
New Delhi
ChinaJamaicaIndia
Malaysia
● Use clustering methods to group keywords of instances together without relying on shared N-grams between instances[4]
● K-means is not suitable because we cannot compute a centroid among string instances, so we use K-medoid clustering
● Use Normalized Google Distance (NGD) as a distance measure between any two keywords in a cluster
● WordNet would not be a suitable distance measure in the GIS domain
[4] Jeffrey Partyka, Latifur Khan, Bhavani M. Thuraisingham: Semantic Schema Matching without Shared Instances. 3rd IEEE International Conference on Semantic Computing (ICSC) Berkeley, California, September 2009: 297-302.
Definition of Google Distance
NGD(x, y)[7] is a measure for the symmetric conditional probability of co-occurrence of x and y
[7] Cilibrasi,R.,Vitányi, P.: The Google Similarity Distance. IEEE Trans. Knowledge and Data Engineering 19, 370--383 (2007)
: Attribute 1
: Attribute 2
Similarity = H(C|T) / H(C)
T1 є O1 T2 є O2
Step 3 Calculate Similarity
Extract distinct keywords from compared attributes
Group distinct keywords together into semantic clusters
Keywords extracted from attributes = {Johnson, Rd., School, 15th,…}
“Rd.”,”Dr.”,”St.”,”Pwy”,…“Johnson”,”School”,”Dr.”….
T1 T2
Step 1
Step 2
roadName City
Johnson Rd. Plano
School Dr. Richardson
Zeppelin St. Lakehurst
Road County
Custer Pwy Collin
15th St. Collin
Parker Rd. Collin
K-medoid + NGD instance similarity
Problems with Non-Geographic Matching via NGD + K-medoidIt is possible that two different geographic entities (ie: Dallas,
TX and Dallas County) in the same location will be mistaken for being similar:
roadName City
Johnson Rd. Plano
School Dr. Richardson
Zeppelin St. Lakehurst
Alma Dr. Richardson
Preston Rd. Addison
Dallas Pkwy Dallas
Road County
Custer Pwy Cooke
15th St. Collin
Parker Rd. Collin
Alma Dr. Collin
Campbell Rd. Denton
Harry Hines Blvd.
Dallas
Topic Outline
• Problem Statement• Background Information• Matching Procedures
- Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching
• Experimental Results• Future Work
Geographic Type MatchingWe use a gazetteer to determine the geographic type (GT) of an instance[5,6]:
Instances of S1
GTs Instances of S2
AnacortesEdmonds
Victoria ?Clinton ?
Victoria ?Clinton ? Victoria ?
[5] Jeffrey Partyka, Latifur Khan, Bhavani M. Thuraisingham: Geographically-Typed Semantic Schema Matching. In: Divyakant, A., Aref, W., Lu, C.T. et al. (eds.) ACM SIGSPATIAL GIS 2009, Seattle, Washington, pp. 456--459. ACM (Nov. 2009)
[6] Jeffrey Partyka, Latifur Khan, Bhavani M. Thuraisingham: Geospatial Schema Matching with High-Quality Clustering and Multi-Attribute Matching. Submitted to the 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2011, May 2011, Shenzhen, China).
Using Latlong Value to Enhance GT Matching
GSim: Combining NGT and GT MatchingWe apply GT matching for an attribute comparison if >= 50%
of the instances involved in the comparison have GT information. If this is not the case, then NGT matching is applied instead[1]:
featureName City
Collin Creek Plano
White Rock Lake
Dallas
Dallas River Lakehurst
Lake County
Cooke Lake Cooke
Mud Lake Collin
Stone Briar Lake
Collin
>= 50% of instances have a GT?
NGT Matching GT Matching
LakeCreekRiver
RockStoneMud
Cooke LakeMud LakeStone Briar Lake
Collin Creek
[1] Jeffrey Partyka, Pallabi Parveen, Latifur Khan, Bhavani M. Thuraisingham, Shashi Shekhar: Enhanced Geographically-Typed Semantic Schema Matching. To appear in the Journal of Web Semantics, 2011.
Topic Outline
• Problem Statement• Background Information• Matching Procedures
- Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching
• Experimental Results• Future Work
Attribute Weighting
• We can distribute the weight of each attribute match based on their importance:
strAdd city state zipCode
1000 Park Blvd. Plano TX 75075
209 Spring Valley Rd.
Richardson
TX 75080
1703 Danube Ln. Plano TX 75075
18431 Roehampton Dr.
Dallas TX 75252
Street Address
City State Zip
100 Genstar Dr.
Dallas TX 75252
2091 Spring Creek Rd.
Plano TX 75075
1704 Danube Ln.
Plano TX 75075
18331 Roehampton Dr.
Dallas TX 75252
27%23%
26% 24%
Measuring Attribute Match Importance• Attribute Match Importance determined by:
name roadType
city townroad_type rName
Attribute Uniqueness 1
2
Attribute Relevance
name city ctyName
lakeType name typelakename
destPort
county
edez_id city
Roads
Ports
Lakes
Roads
Sea Ports
LakeFeatures
Dest
Attribute Uniqueness• Determine uniqueness of attributes att1 and att2 involved
in a match (att1-att2) by clustering all attributes from all tables over S1 and S2 :
cutoff 1
cutoff 2
Attribute Clustering
• Use Intercluster Similarity (ICS) to decide if clusters A and B should merge:
• Calculate cutoff point (CP) to determine when to stop clustering:
Cutoff Point vs. # of Cluster Iterations
Calculating AU, corrected EBD value • Calculate AU for an attribute att in a match:
• Calculate pairwise uniqueness (PU) for a match att1-att2:
PUatt1,att2 = avg (AUatt1(T) , AUatt2(T’))
• Recalculate EBD between att1(T)-att2(T’):
EBDcorr (att1,att2) = EBDorig(att1, att2) x PUatt1,att2
rNamename (Roads)name (Ports)
lakename
name (Lakes)Name (Sea
Ports)
destPort Dest
Att Match PUatt1-att2 EBDorig
EBDcorr
Name(Ports) – Name (Sea Ports)
.688 .90 .619
destPort–Dest .938 .80 .750
AUatt ϵ [0,1]
Attribute Weighting Algorithm
Topic Outline
• Problem Statement• Background Information• Matching Procedures
- Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching
• Experimental Results• Future Work
High-Quality Clustering• Due to the inherent randomness of clustering (e.g: choosing
initial centroid), EBD scores may not be stable [6]
• We need a way to produce consistent EBD values - To eliminate EBD variability - To provide a confidence value for our EBD value - To guarantee that our EBD value was generated from a high- quality clustering
• We proposed the following two cluster-based measures (1: Semantic Purity: the “meaning distance” between any two instances within the same cluster (2: Geographic Purity: the GT purity of a given cluster
Cluster Purity MeasuresDistance-based Measure:
ImpS =
Geographic-Type Measure:
Objective Function to be Minimized:
OSSKM = where Wi =
CollinTarrantPlano
KaufmanCoppellRichardso
n
CollinTarrantKaufma
n
PlanoCoppell
Richardson
Topic Outline
• Problem Statement• Background Information• Matching Procedures
- Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching
• Experimental Results• Future Work
1:N MatchingMany relationships are not 1:1, but involve matching groups of entities
id Mailing Address
1 12 Plano Dr., Plano, TX, 75075
2 18 Coit Rd., Richardson, TX, 75080
3 200 Preston Rd., Dallas, TX
4 2 Hedgecoxe Rd.
Street Address
City State Zip
100 Genstar Dr.
Dallas TX 75252
2091 Spring Creek Rd.
Plano TX 75075
1704 Danube Ln.
Plano TX 75075
18331 Roehampton Dr.
Dallas TX 75252
Cmp N1 N2 N3 N4
Defining 1:N Matching• 1:N matching can be defined in many ways
- Optimize similarity or value of N? - Meronymy or Subsumption?
• We chose to optimize similarity (EBD) - Use EBD scores produced from 1-1 matches between
Cmp and Nk
(1 <= k <= N) - Apply greedy algorithm to add attributes to match with Cmp based on decreasing EBD score (highest to lowest) - Any 1:N match will minimize the set difference between GT(Cmp) and the union of the sets of GTs for the N matching attributes. - We do not include an attribute in a 1:N match if it would make the EBD of the current match decrease
1:N Matching Example
id Mailing Address
1 12 Plano Dr., Plano, TX, 75075
2 18 Coit Rd., Richardson, TX, 75080
4 2 Hedgecoxe Rd.
Street Address
City State Zip County
100 Genstar Dr.
Dallas TX 75252 Dallas
2091 Spring Creek Rd.
Plano TX 75075 Collin
1704 Danube Ln.
Plano TX 75075 Collin
Cmp N1 N2 N3 N4
1 2 3 4
Attribute
1-1 EBD w/ Cmp
1:N EBD
Street Address
.81 .81
City .79 .88
State .72 .92
Zip .66 .95
Final 1:N EBD .95
1
23
4
1:N Matching From Type PerspectiveMailing Address
W X Y Z
W X Y Z
W X Y
Street Address
City State Zip
W X Y Z
W X Y Z
W X Y Z
XX XY
Y Y
YY
Y XX
XY Y
X
YYY
XXXX
WW
Entropy = H(C) Conditional Entropy = H(C|T)
Z
Z
W WWZ Z
WW W
ZZ
Z
X
Z Z
ZY
WW
W W
Greedy 1:N Matching Algorithmprogram 1:N_Matching (S(T2), Sebd(T2)) {
var E(T2) = Φ; var S(T2) = Φ; Sebd(T2) = 0.0; GTCmp = getGTSet(Cmp); E(T2) = getMatchCandidates(Cmp, T2, GTCmp); E(T2) = orderByEBD(E(T2));
For att A ϵ E(T2) with max value of EBD(Cmp,A){ if (increaseEBD(Cmp, Sebd(T2)) { Emax = A; S(T2) = S(T2) U Emax; Sebd(T2) = addEBD(Sebd(T2), EBD(Cmp, Emax)) end if E(T2) = E(T2) – A; end for
}
Proof of CorrectnessTheorem 1: (Proof of Greedy Choice Property for 1:N matching algorithm) – All choices for Emaxx(T2) will be present in an optimal 1:N match with Cmp ϵ T1.
Suppose that SebdN(T2), for an arbitrary SN(T2), produces an optimal EBD. Let us build a new set called S2ebdN(T2) from S2N(T2) such that every attribute included in S2N(T2) represented a value of Emaxx (T2) for some x. Also, the cardinality of SN(T2) and S2N(T2) are equal, and every attribute between SN(T2) and S2N(T2) is identical, except for an arbitrary attribute indexed by r (r <= N) in S2N(T2). Then by the definition of Emaxx for all x in Ex(T2), the EBD value produced between Cmp and attribute r in S2N(T2) is >= the EBD value produced between Cmp and attribute r given in SN(T2) . Since all other attributes are equal between SN(T2) and S2N(T2), then their associated 1:1 EBD scores with Cmp are also identical. Therefore, EBD(Cmp, S2N(T2)) >= EBD (Cmp, SN(T2)), but since SN(T2) produces an optimal EBD with Cmp through SebdN(T2), then EBD(Cmp, S2N(T2)) = EBD (Cmp, SN(T2)). Thus, S2N(T2) also produces an optimal EBD with Cmp through S2ebdN(T2).
Proof of Correctness (cont)Theorem 2: (Proof of optimal substructure property) –
Let SebdN-1(T2), N > 1, be the EBD score corresponding to the attribute match between Cmp ϵ T1 and SN-1(T2) ϵ T2. If SebdN-1(T2) is an optimal EBD score, and SebdN(T2) is obtained by adding Emaxx to SN-1(T2) , then SebdN(T2) must also be an optimal EBD score.
Assume that SN(T2) was formed by adding Emaxx to SN-1(T2), but does not produce an optimal value of SebdN(T2). Emaxx represents the attribute with the highest EBD score with Cmp to be included in SN-1(T2) with respect to all other attributes in Ex(T2). Then this means that SN-1(T2) contains some attribute indexed by r (r <= N-1) whose EBD value is less than that of Emaxr. Thus, SebdN-1(T2) is not an optimal EBD score. This contradicts the statement above that SebdN-1(T2) is an optimal EBD score. Therefore, if SebdN-
1(T2) is an optimal EBD score, and SebdN(T2) is obtained by adding Emaxx to SN-1(T2), then SebdN(T2) must be an optimal EBD score.
Theorem 3: Greedy 1:N matching produces a safe match with an optimal EBD score. This follows from Theorem 1 and Theorem 2.
Dataset Details GIS Transportation Dataset (GTD)
GIS Location Dataset (GLD)
Dataset Details (cont) GIS Point of Interest Dataset (GPD)
- Through all of our datasets, few shared instances exist
- Data is multijurisdictional in nature
- Number of attributes and instances differ
NGT Matching Over GTD
GT Matching Over GTD
The Effect of Latlong Values on Matching in GPD
The Effect of Attribute Weighting on Matching in GTD and GLD
Observing the Effects of Multiple Matching Methods over GPD
(1) GT matching(2) GT matching + latlong(3) GT matching + latlong + NGT matching(4) GT matching + latlong + NGT matching + attribute weighting
GSim vs. N-grams, SVD, NMF & GSimG
1:N Matching Experiment Results
Experiment 1
T1 = {‘Address’}T2 = {‘Street Address’, ‘City’, ‘State’, ‘Zip’}
1:N Matching Experiment Results (cont)
Experiment 2
T1 = {‘Island_Group’}T2 = {‘Island1’, ‘Island2’, ‘Island3’, ‘Island4’, ‘Island5’, ‘Island6’}
‘Island6’ is not a part of ‘Island_Group’
Summary of Matching Methods
Exact Match
Synonym Match
GT Match
GT + Latlong Match
Hierarchical GT Match
N-grams
NGT Matching
GT Matching
GT + Latlong
GT + Cluster Purity
Final GSim (Ideal)
Hierarchical GT Matching• Use a GT hierarchy to match types with relationships
between them (e.g: superclass/subclass, meronym/holonym, etc.)
Bodies of Water
Lakes
Rivers
Rapids
Streams
att1
Dell Lake
Dallas River
Coppell Stream
att2
HP Lake
Collin River
Plano Rapids
Dell Lake
HP Lake
Dallas RiverCoppell
StreamCollin RiverPlano
Rapids
Get GT Relations From Ontology Calculate
Similarity
THANK YOU!
ANY QUESTIONS?