Measuring Semantic Similarity and Relatedness in the Biomedical Domain : Methods and Applications
Ted Pedersen, Ph.D. Department of Computer ScienceUniversity of Minnesota, Duluth
[email protected]://www.d.umn.edu/~tpederse
February 21, 2012
Acknowledgments
This work on semantic similarity and relatedness has been supported by a National Science Foundation CAREER award (2001 2007, #0092784, PI Pedersen) and by the National Library of Medicine, National Institutes of Health (2008 present, 1R01LM009623-01A2, PI Pakhomov)
The contents of this talk are solely my responsibility and do not necessarily represent the ocial views of the National Science Foundation or the National Institutes of Health.
Topics
Semantic similarity vs. semantic relatedness
How to measure similarity With ontologies and corpora
How to measure relatednessWith definitions and corpora
Applications? Word Sense Disambiguation
Sentiment Classification
What are we measuring?
Concept pairsAssign a numeric value that quantifies how similar or related two concepts are
Not wordsMust know concept underlying a word form
Cold may be temperature or illnessConcept Mapping
Word Sense Disambiguation
But, WSD can be done using semantic similarity!SenseRelate, later in the talk
Why?
Being able to organize concepts by their similarity or relatedness to each other is a fundamental operation in the human mind, and in many problems in Natural Language Processing and Artificial Intelligence
If we know a lot about X, and if we know Y is similar to X, then a lot of what we know about X may apply to YUse X to explain or categorize Y
GOOD NEWS!
Free Open Source Software!
WordNet::Similarity http://wn-similarity.sourceforge.net
General English
Widely used (+550 citations)
UMLS::Similarityhttp://umls-similarity.sourceforge.net
Unified Medical Language System
Spun off from WordNet::SimilarityBut has added a whole lot!
Similar or Related?
Similarity based on is-a relations
How much is X like Y?
Share ancestor in is-a hierarchy LCS : least common subsumer
Closer / deeper the ancestor the more similar
Tetanus and strep throat are similarboth are kinds-of bacterial infections
Least Common Subsumer (LCS)
Similar or Related?
Relatedness more generalHow much is X related to Y?
Many ways to be relatedis-a, part-of, treats, affects, symptom-of, ...
Tetanus and deep cuts are related but they really aren't similar(deep cuts can cause tetanus)
All similar concepts are related, but not all related concepts are similar
Measures of Similarity
(WordNet::Similarity & UMLS::Similarity )
Path Based
Rada et al., 1989 (path)
Caviedes & Cimino, 2004 (cdist)*cdist only in UMLS::Similarity
Path + Depth
Wu & Palmer, 1994 (wup)
Leacock & Chodorow, 1998 (lch)
Zhong et al., 2002 (zhong)*
Nguyen & Al-Mubaid, 2006 (nam)*zhong and nam only in UMLS::Similarity
Measures of Similarity
(WordNet::Similarity & UMLS::Similarity)
Path + Information ContentResnik, 1995 (res)
Jiang & Conrath, 1997 (jcn)
Lin, 1998 (lin)
Path Based Measures
Distance between concepts (nodes) in tree intuitively appealing
Spatial orientation, good for networks or maps but not is-a hierarchiesReasonable approximation sometimes
Assumes all paths have same weight
But, more specific (deeper) paths tend to travel less semantic distance
Shortest path a good start, but needs corrections
Shortest is-a Path
1path(a,b) = ------------------------------ shortest is-a path(a,b)
We count nodes...
Maximum = 1 self similarity
path(tetanus,tetanus) = 1
Minimum = 1 / (longest path in isa tree)path(typhoid, oral thrush) = 1/7
path(food poisoning, strep throat) = 1/7
etc...
path(strep throat, tetanus) = .25
path (bacterial infections, yeast infections) = .25
?
Are bacterial infections and yeast infections similar to the same degree as are tetanus and strep throat ?
The path measure says yes, they are.
Path + Depth
Path only doesn't account for specificity
Deeper concepts more specific
Paths between deeper concepts travel less semantic distance
Wu and Palmer, 1994
2 * depth (LCS (a,b))wup(a,b) = ---------------------------- depth (a) + depth (b)
depth(x) = shortest is-a path(root,x)
wup(strep throat, tetanus) = (2*2)/(4+3) = .57
wup (bacterial infections, yeast infections) = (2*1)/(2+3) = .4
?
Wu and Palmer say that strep throat and tetanus (.57) are more similar than are bacterial infections and yeast infections (.4)
Path says that strep throat and tetanus (.25) are equally similar as are bacterial infections and yeast infections (.25)
Information Content
ic(concept) = -log p(concept) [Resnik 1995]Need to count concepts
Term frequency +Inherited frequency
p(concept) = tf + if / N
Depth shows specificity but not frequency
Low frequency concepts often much more specific than high frequency ones
Information Content
term frequency (tf)
Information Content
inherited frequency (if)
Information Content (IC = -log (f/N))
final count (f = tf + if, N = 365,820)
Lin, 1998
2 * IC (LCS (a,b))lin(a,b) = -------------------------- IC (a) + IC (b)
Look familiar? 2* depth (LCS (a,b) )
wup(a,b) = ------------------------------ depth(a) + depth (b)
lin (strep throat, tetanus) =
2 * 2.26 / (5.21 + 4.11) = 0.485
lin (strep throat, tetanus) =
2 * 2.26 / (5.21 + 4.11) = 0.485
lin (bacterial infection, yeast infection) =
2 * 0.71 / (2.26+2.81) = 0.280
?
Lin says that strep throat and tetanus (.49) are more similar than are bacterial infection and yeast infection (.28)
Wu and Palmer say that strep throat and tetanus (.57) are more similar than are bacterial infection and yeast infection (.4)
Path says that strep throat and tetanus (.25) are equally similar as are bacterial infection and yeast infection (.25)
How to decide?
Consider quality of informationIf you have a consistent and reliable ontology, maybe depth based works best
If you have corpora with reliable sense tags or concept mapping, maybe information content
EvaluationIntrinsic (relative to human performance)Data at : http://rxinformatics.umn.edu
Extrinsic (task based)
What about concepts not connected via is-a relations?
Connected via other relations?Part-of, treatment-of, causes, etc.
Not connected at all?In different sections (axes) of an ontology (infections and treatments)
In different ontologies entirely (SNOMEDCT and FMA)
Relatedness!Use definition information
No is-a relations so can't be similarity
Measures of relatedness
WordNet::Similarity & UMLS::Similarity
Definition based
Lesk, 1986
Adapted lesk (lesk)Banerjee & Pedersen, 2003
Definition + corpus
Gloss Vector (vector)Patwardhan & Pedersen, 2006
Measuring relatedness with definitions
Related concepts defined using many of the same terms
But, definitions are short, inconsistent
Concepts don't need to be connected via relations or paths to measure themLesk, 1986
Adapted Lesk, Banerjee & Pedersen, 2003
Two separate ontologies...
Could join them together ?
Each concept has definition
Find overlaps in definitions...
Overlaps
Oral Thrush and Alopeciaside effect of chemotherapyCan't see this in structure of is-a hierarchies
Oral thrush and folliculitis just as similar
Alopecia and Folliculitis hair disorder & hairReflects structure of is-a hierarchies
If you start with text like this maybe you can build is-a hierarchies automatically!Another tutorial...
Lesk and Adapted Lesk
Lesk, 1986 : measure overlaps in definitions to assign senses to wordsThe more overlaps between two senses (concepts), the more related
Banerjee & Pedersen, 2003, Adapted LeskAugment definition of each concept with definitions of related conceptsBuild a super gloss
Increase chance of finding overlaps
lesk in WordNet::Similarity & UMLS::Similarity
The problem with definitions ...
Definitions contain variations of terminology that make it impossible to find exact overlaps
Alopecia : a result of cancer treatment
Thrush : a side effect of chemotherapyReal life example, I modified the alopecia definition to work better with Lesk!!!
NO MATCHES!!
How can we see that result and side effect are similar, as are cancer treatment and chemotherapy ?
Gloss Vector Measure
of Semantic Relatedness
Rely on co-occurrences of terms Terms that occur within some given number of terms of each other other
Allows for a fuzzier notion of matching
Exploits second order co-occurrencesFriend of a friend relation
Suppose cancer_treatment and chemotherapy don't occur in text with each other. But, suppose that survival occurs with each.
cancer_treatment and chemotherapy are second order co-occurrences via survival
Gloss Vector Measure
of Semantic Relatedness
Replace words or terms in definitions with vector of co-occurrences observed in corpus
Defined concept now represented by an averaged vector of co-occurrences
Measure relatedness of concepts via cosine between their respective vectors
Patwardhan and Pedersen, 2006Inspired by Schutze, 1998
vector in WordNet::Similarity & UMLS::Similarity
Experimental Results
Vector > Lesk > Info Content > Depth > PathClear trend across various studies
Dramatic differences when comparing to human reference standards (Vector > Lesk >> Info Content > Depth > Path)Banerjee and Pedersen, 2003 (IJCAI)
Pedersen, et al. 2007 (JBI)
Differences less extreme in extrinsic task-based evaluations Human raters mix up similarity & relatedness?
So far we've shown that ...
we can quantify the similarity and relatedness between concepts using a variety of sources of informationPaths
Depths
Information content
Definitions
Co-occurrence / corpus data
There is open source software to help you!
Sounds great! What now?
SenseRelate Hypothesis : Most words in text will have multiple possible senses and will often be used with the sense most related to those of surrounding wordsHe either has a cold or the fluCold not likely to mean temperature
The underlying sentiment of a text can be discovered by determining which emotion is most related to the words in that text I cried a lot after my mother died. Happy?
SenseRelate!
In coherent text words will be used in similar or related senses, and these will also be related to the overall topic or mood of a text
First applied to WSD in 2002Banerjee and Pedersen, 2002 (WordNet)
Patwardhan et al., 2003 (WordNet)
Pedersen and Kolhatkar 2009 (WordNet)
McInnes et al., 2011 (UMLS)
Recently applied to emotion classificationPedersen, 2012 (i2b2 suicide notes challenge)
(MORE) GOOD NEWS!
Free Open Source Software!
WordNet::SenseRelateAllWords, TargetWord, WordToSet
http://senserelate.sourceforge.net
UMLS::SenseRelateAllWords, TargetWord
http://search.cpan.org/dist/UMLS-SenseRelate/
SenseRelate for WSD
Assign each word the sense which is most similar or related to one or more of its neighborsPairwise
2 or more neighbors
Pairwise algorithm results in a trellis much like in HMMsMore neighbors adds lots of information and a lot of computational complexity
SenseRelate - pairwise
SenseRelate 2 neighbors
SenseRelate for
Sentiment Classification
Find emotion most related to contextSimilarity less effective since many words can be related an emotion, but fewer are similar Related to happy? : love, food, success, ...
Similar to happy? : joyful, ecstatic, pleased,
Pairwise comparisons between emotion and senses of words in context
Same form as Naive Bayesian model or Latent Variable modelWordNet::SenseRelate::WordToSet
SenseRelate - WordToSet
Experimental Results
WSD results vary with part of speechNouns reasonably accurate; verbs, adjectives and adverbs less so
Jiang-Conrath measure often a high performer for nouns (e.g., Patwardhan et al. 2003)
Vector and lesk do well because of coverage (handle mixed pairs while others don't)
Sentiment classification results in 2011 i2b2 suicide notes challenge were disappointing (Pedersen, 2012)Suicide notes not very emotional!
Future Work
Integrate Unsupervised Clustering with WordNet::Similarity and UMLS::Similarityhttp://senseclusters.sourceforge.net
Exploit graphical nature of of SenseRelatee.g., Minimal Spanning Trees / Viterbi Algorithm to solve larger problem spaces?
Incorporate SenseRelate into BiomedicusOngoing, Pakhomov and Bill (UMTC)
UIMA/Java based Clinical NLP pipeline
Attract and support users for all of these tools!
Conclusion
Measures of semantic similarity and relatedness are supported by a rich body of theory, and good open source softwarehttp://wn-similarity.sourceforge.net
http://umls-similarity.sourceforge.net
These measures can be viewed as building blocks for many NLP and AI applicationsWord sense disambiguationhttp://senserelate.sourceforge.net
Sentiment classification
UMLS::Similarity Collaborators
Serguei Pakhomov : Mayo, UMTCPI of NIH similarity and relatedness grant that enabled development of UMLS::Similarity
Bridget McInnes BS UMD, 2002; MS UMD, 2004
PhD UMTC, 2009
Post-doc UMTC, 2009 - 2011
Now at Securboration
Ying Liu Post-doc UMTC 2009 - present
WordNet::Similarity Collaborators
Varada Kolhatkar : MS UMD, 2009Now PhD student at U of Toronto
Jason Michelizzi, : MS UMD, 2005Now US Navy pilot
Siddharth Patwardhan : MS UMD, 2003PhD Utah, 2009
Now at IBM Watson Research
Satanjeev Banerjee : MS UMD 2002PhD CMU, 2010
Now at Twitter Research
Development Milestones
Satanjeev Banerjee MS thesis 2002Lesk algorithm for WSD, realized that lesk measure and WSD could be separated!
Original developer of WordNet::Similarity::lesk
Siddharth Patwardhan, MS thesis 2003Original developer of WordNet::Similarity
Original developer of WordNet::SenseRelate::TargetWord
Original developer of SNOMED::Similarity At Mayo, Summer 2003 with S. Pakhomov
Development Milestones
Jason Michelizzi, MS thesis 2005Rewrote WordNet::Similarity
Original developer of WordNet::SenseRelate::AllWords and WordNet::SenseRelate::WordToSet
Varada Kolhatkar, MS thesis 2009Rewrote WordNet::SenseRelate::AllWords
Development Milestones
Bridget McInnes, post-doc UMTC 2009 2011Original developer of UMLS::Similarity Spun off from WordNet::Similarity and SNOMED::Similarity
Original developer of UMLS::SenseRelateSpun off from WordNet::SenseRelate::AllWords and WordNet::SenseRelate::TargetWord
Ying Liu, post-doc UMTC 2009 presentOriginal developer of gloss vector measure for UMLS::SimilaritySpun off (conceptually at least) from WordNet::Similarity::vector
References
S. Banerjee and T. Pedersen. An adapted Lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics, pages 136145, Mexico City, February 2002.
S. Banerjee and T. Pedersen. Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pages 805-810, Acapulco, August 2003.
J. Caviedes and J. Cimino. Towards the development of a conceptual distance metric for the UMLS. Journal of Biomedical Informatics, 37(2):77-85, April 2004.
J. Jiang and D. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings on International Conference on Research in Computational Linguistics, pages 19-33, Taiwan, 1997.
References
C. Leacock and M. Chodorow. Combining local context and WordNet similarity for word sense identification. In C. Fellbaum, editor, WordNet: An electronic lexical database, pages 265-283. MIT Press, 1998.
M.E. Lesk. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine code from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation, pages 24-26. ACM Press, 1986.
D. Lin. An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning, Madison, August 1998.
B. McInnes, T. Pedersen, Y. Liu, G. Melton and S. Pakhomov. Knowledge-based Method for Determining the Meaning of Ambiguous Biomedical Terms Using Information Content Measures of Similarity. Appears in the Proceedings of the Annual Symposium of the American Medical Informatics Association, pages 895-904, Washington, DC, October 2011.
References
H.A. Nguyen and H. Al-Mubaid. New ontology-based semantic similarity measure for the biomedical domain. In Proceedings of the IEEE International Conference on Granular Computing, pages 623-628, Atlanta, GA, May 2006.
S. Patwardhan, S. Banerjee, and T. Pedersen. Using measures of semantic relatedness for word sense disambiguation. In Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, pages 241257, Mexico City, February 2003.
S. Patwardhan and T. Pedersen. Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts. In Proceedings of the EACL 2006 Workshop on Making Sense of Sense: Bringing Computational Linguistics and Psycholinguistics Together, pages 1-8, Trento, Italy, April 2006.
T. Pedersen. Rule-based and lightly supervised methods to predict emotions in suicide notes. Biomedical Informatics Insights, 2012:5 (Suppl. 1):185-193, January 2012.
References
T. Pedersen and V. Kolhatkar. WordNet :: SenseRelate :: AllWords - a broad coverage word sense tagger that maximizes semantic relatedness. In Proceedings of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies 2009 Conference, pages 17-20, Boulder, CO, June 2009.
T. Pedersen, S. Pakhomov, S. Patwardhan, and C. Chute. Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics, 40(3) : 288-299, June 2007.
R. Rada, H. Mili, E. Bicknell, and M. Blettner. Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics, 19(1):17-30, 1989.
References
P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, pages 448-453, Montreal, August 1995.
H. Schtze. Automatic word sense discrimination. Computational Linguistics, 24(1):97-123, 1998.
J. Zhong, H. Zhu, J. Li, and Y. Yu. Conceptual graph matching for semantic search. Proceedings of the 10th International Conference on Conceptual Structures, pages 92-106, 2002.
Semantic Similarity
for the Gene Ontology
Various measures supported: http://www.geneontology.org/GO.tools_by_type.semantic_similarity.shtml
Path Based Relatedness
Ontologies include relations other than is-a
These can be used to find shortest paths between concepts e.g., WordNet::Similarity::hso
However, a path made up of different kinds of relations can lead to big semantic jumps
Aspirin treats headaches which are a symptom of the flu which can be prevented by a flu vaccine which is recommend for children . so aspirin and children are related ??
Path Based Measure of Relatedness
WebServices::UMLSKS::SimilarityHSO implementation for UMLSDeveloped at University of Minnesota, Duluth 2010-2012, ongoing
http://search.cpan.org/dist/WebService-UMLSKS-Similarity/
Click to edit the title text format
Top Related