STL: A similarity measure based on semantic and linguistic information
-
Upload
nitish-aggarwal -
Category
Education
-
view
293 -
download
1
Transcript of STL: A similarity measure based on semantic and linguistic information
![Page 1: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/1.jpg)
Copyright 2011 Digital Enterprise Research Institute. All rights reserved.
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
STL : A Similarity Measure Based on
Semantic, Terminological and Linguistic
InformationNitish Aggarwal
joint work with Tobias Wunner, MihaelArcan
DERI, NUI Galway
Friday,19th Aug, 2011
DERI, Friday Meeting
![Page 2: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/2.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Overview
Motivation & Applications
Why STL?
Semantic
Terminology
Linguistic
Evaluation
Conclusion and future work
2
![Page 3: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/3.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Motivation & Applications
SemanticAnnotation
Similarity between corpus data and ontology concepts
SAP AG held €1615 million in
short-term liquid assets (2009)
“dbpedia:SAP_AG” “xEBR:LiquidAssets”
at “dbpedia:year:2009”
3
![Page 4: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/4.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
SemanticSearch
Similarity between Query and index object
SAP liquid asset in 2010
Net cash of SAP in 2010
Current asset of SAP last year
SAP total amount received in 2010
“dbpedia:SAP_AG” “xEBR:liquid asset” at “dbpedia:year:2010”
Motivation & Applications
4
![Page 5: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/5.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Motivation & Applications
OntologyMatching&Alignment
Similarity between ontology concepts
Assets
xebr:SubscribedCa
pitalUnpaid xebr:FixedAssets xebr:CurrentAssets
xebr:TangibleFix
edAssetsxebr:Amount
Receivable
xebr:Liquid
Assets
xebr:IntangibleFix
edAssets
xebr:KeyBalanceSheet
Ifrs:Assets
Ifrs:CurrentAssets ifrs:BiologicalAssets Ifrs:NonCurrentAssets
Ifrs:InventoriesIfrs:TradeAndOtherC
urrentReceivables
ifrs:PropertyPlantAn
dEquipment
ifrs:CashAndCashE
quivalents
ifrs:StatementOfFinancialPosition
Similarity = ?Similarity = ?
5
![Page 6: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/6.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Classical Approaches
String Similarity
Levenshteindistance, Dice Coefficient
Corpus-based
LSA, ESA, Google distance,Vector-Space Model
Ontology-based
Path distance, Information content
Syntax Similarity
Word-order, Part of Speech
6
![Page 7: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/7.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Why STL?
Semantic
Semanticstructure and relations
Terminology
complex terms expressing the same concept
Linguistic
Phrase and dependency structure
7
![Page 8: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/8.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
STL
Definition
Linear combination of semantic, terminological and linguistic
obtained by using a linear regression
Formula used
STL = w1*S + w2*T + w3*L + Constant
– w1, w2, w3 represent the contribution of each
8
![Page 9: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/9.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Semantic
WuPalmer
2*depth(MSCA) / depth(c1) + depth(c2)
Resnik’s Information Content
IC(c) = -log p(c)
Intrinsic Information Content (Pirro09)
Overcome the analysis of large corpora
9
![Page 10: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/10.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Cont.
Intrinsic information content(iIC)
.
where sub(c) is number of sub-concept of given concept c.
Pirro_Similarity
10
![Page 11: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/11.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Cont.
Assets
Subscribed Capital Unpaid Fixed AssetsCurrent Assets
Tangible Fixed Assets Amount Receivable
Amount Receivable
[total]
Other Tangible
Fixed Assets
Amount Receivable
with in one year
Amount Receivable
after more than one
year
Trade Debtors
Property, Plant
and Equipment
Other FixturePlant and Machinery
Land and
BuildingOther Debtors
Payments on
account and asset
in construction
Furniture Fixture
and Equipment
Other Property, Plant
and EquipmentProperty, Plant
and Equipment [Total]
StocksPirro_Sim =?
MSCA
subconcepts = 48
IC (TFA) = 0.32
subconcepts = 9
IC (TFA) = 0.60
subconcepts = 6
IC (AR) = 0.69
Pirro_Sim = 0.33
11
![Page 12: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/12.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Limitation
Does semantic structure reflect a good
similarity?
not necessarily
– e.g. In xEBR, parent-child relation for describing the layout of
concepts
– “Work in progress” is not a type of asset, although both are
linked via the parent-child relationship
12
![Page 13: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/13.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Terminology
Definition
Common naming convention
Ngram Vs subterms
In financial domain, bigram ”Intangible Fixed” is a subtring of
”Other Intangible Fixed Assets” but not a subterm.
Terminological similarity
maximal subterm overlap
13
![Page 14: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/14.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Cont.
Trade Debts Payable After More Than One Year
Financial Debts Payable After More Than One Year Financial[Debts][Payable][After More Than One Year]
[Investopedia:Trade]
[Ifrs:After More Than One Year][Investoword:Debt]
[SAP:Payable]
[FinanceDict:Trade Debts]
[[Trade][Debts]][Payable][After More Than One Year]
14
![Page 15: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/15.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Multilingual Subterms
Translatedsubterms
Available in otherlanguages
Advantage
Reflect terminological similarities that may be available in one
language but not in others.
15
”Property Plant and Equipment”@en
”Tangible Fixed Asset” @en
”Sachanlagen”@de
![Page 16: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/16.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Linguistic
Syntactic Information
Beyond simple word order
– phrase structure
– Dependency structure
Phrase structure
Intangible fixed : adj adj > ??
Intangible fixed assets : adj adj n > NP
Dependency structure
Amounts receivable : N Adv : receive:mod, amounts:head
Received amounts : V N : receive:mod, amounts:head
16
![Page 17: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/17.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Evaluation
Data Set
xEBR finance vocabulary
269 terms (concept labels)
72,361(269*269) termpairs
Benchmarks
SimSem59: sample of 59 term pairs
SimSem200 : sample of 200 term pairs (under construction)
17
![Page 18: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/18.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Experiment
An overview of similarity measures
18
![Page 19: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/19.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Experiment Results
(Simsem59)
STL formula used
STL = 0.1531 * S + 0.5218 * T + 0.1041 * L + 0.1791
Correlation between similarity scores & simsem59
19
Semantic
Contribution
Terminology
Contribution
Linguistic
Contribution
![Page 20: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/20.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Conclusion
STL outperforms more traditional similarity measures
Largest contribution by T (Terminological Analysis)
Multilingual subterms performs better than monolingual
20
![Page 21: STL: A similarity measure based on semantic and linguistic information](https://reader035.fdocuments.net/reader035/viewer/2022081401/55965af61a28abeb5f8b47cc/html5/thumbnails/21.jpg)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Future work
Evaluation on larger data set and vocabularies (IFRS)
3000+ terms
9M term pairs
richer set of linguistic operations
“recognise” => “recognition”
by derivation rule verb_lemma+"ion”
Similarity between subterms
“Staff Costs” and "Wages And Salaries"
21