Resolving abbreviations to their senses in Medline S. Gaudan, H. Kirsch and D. Rebholz-Schuhmann...
-
Upload
matthew-morrison -
Category
Documents
-
view
213 -
download
0
Transcript of Resolving abbreviations to their senses in Medline S. Gaudan, H. Kirsch and D. Rebholz-Schuhmann...
Resolving abbreviations to their senses in Medline
S. Gaudan, H. Kirsch and D. Rebholz-Schuhmann
European Bioinformatics Institute, Wellcome Trust Genome Campus,
Hinxton, Cambridge, UK(Bioinformatics, Vol. 21, no. 18, 2005, p. 3658-3664)
2/24
Abstract Abbreviation resolution improves accuracy of
document retrieval engines and of IE systems.
Authors combine an automatic analysis of Medline abstracts and linguistic methods to build a dictionary of abbreviation/sense pairs.
Ambiguous global abbreviations are resolved using SVM.
The system disambiguates abbreviations with a precision of 98.9% for a recall of 98.2%.
3/24
1. Introduction (1/2) Global abbreviations appear in documents
without the long form explicitly stated, whereas local abbreviations come together with their long form in the document.
Common abbreviations become accepted as synonyms and represent important terms in their domain, whereas dynamic abbreviations are defined for convenience in only a particular paper.
The most problematic step in abbreviation resolution is retrieving the sense of a global abbreviation that is ambiguous.
4/24
1. Introduction (2/2) Disambiguation schema
1. A lexicon is used for collecting the abbreviations and their senses.
2. Then the method computes the context of use for each sense.
3. Finally, a machine-learning algorithm is trained on the context of each sense.
4. The disambiguation of an abbreviation contained in a document consists of computing its context in the document.
5. Retrieval of the most probable abbreviation sense, given the context, thanks to the machine-learning algorithm.
5/24
2. Dictionary of Abbreviations
After scanning all Medline abstracts available in August 2004, the result is 5,250,259 long-form/abbreviation pairs found in 2,857,954 Medline abstracts (refer to as D).
Example of long-form/abbreviation The changes in adrenocorticotropin hormone (AC
TH), cortisol and dehydroepiandrosterone (DHEA) in maternal and fetal plasma were estimated in two groups of women.
6/24
2.1 Merging morphologically similar long forms (1/2) e.g. ‘oestrogen receptor’ versus ‘estr
ogen-receptor’. An n-gram similarity algorithm is used wit
h a cut-off parameter (0.8) to merge similar long forms l1 and l2:
E.g., grams3(‘hello’)={‘hel’, ‘ell’, ‘llo’}
|)(||)(|
|)()(|),,(
21
2121
lgramslgrams
lgramslgramsnllsimilarity
nn
nn
7/24
2.1 Merging morphologically similar long forms (2/2)
Long form 1 Long form 2
Computed radiography
Computed radiographic
Compression ratios Compression rate
Caloric restriction Calorie-restricted
Thrombocytopenia with absent radii
Thrombocytopenia and absent radius
Transactivator responsive element
Trans-activator response element
8/24
2.2 Context based merging (1/3) Some long forms can be morphologically qui
te different (e.g. ‘beta site APP-cleaving enzyme’ versus ‘beta site amyloid precursor protein-cleaving enzyme’) but still code for the same meaning.
The similarity between two sets of long forms (g1 and g2) is computed by considering the number of common words in the sets Dg1 and Dg2 of documents containing the long forms, normalized by the total number of words in the documents of the two sets:
9/24
2.2 Context based merging (2/2)
.document in the wordsofset theis )( where,1|| and 1|| if
,|)(||)(|
|)()(|2
||||
1),(
with),(),(
),(),(
21
,,21
21
2211
2121
21
i
igg
ddDdDd ji
ji
gg
ddWDD
dWdW
dWdW
DDggc
ggcggc
ggcggsimilarity
jigjgi
10/24
2.2 Context based merging (3/3)
Long form 1 Long form 2
Alpha-amino-3-hydroxy-5-methyl-4-isoxazolepropionic acid receptors
AMPA receptors
Silver-stained nucleolar organizer regions
Argyrophilic staining of nucleolar organizer regions
Complete remission Complete response
11/24
3. Disambiguation of Abbreviations
Whenever no long form associated to an ambiguous abbreviation is found, the context is used to identify the correct meaning of the abbreviation. Which suitable context words are
generated to disambiguate abbreviations?
How is the classifier trained?
12/24
3.1 Context extraction C-value algorithm (Frantzi and Ananiadou,
1999)
Extract a tuple of size n (55 on average) of relevant words for every document.
. candidate in the contained patternsnoun -adjective ofset theis and corpus in the offrequency
theis , of (in words)length theis ,candidatespattern noun -adjective theis where
wTwf(w)ww
w
w
wTvw
vfT
wfwwvalueC )(||
1)()log()(
13/24
3.2 The model (1/2) S(a) is the set of senses for each abbreviatio
n a in the dictionary. Each sense s∈S(a) is illustrated by a set of d
ocuments Ds⊂D. Ds is the set of documents containing the ab
breviation/long-form pairs. For each document d, the context words ar
e extracted and the document is described by a vector v=g(d) with g: D →{0, 1}n.
14/24
3.2 The model (2/2)
The ith component of v, vi, is defined as
The function associates with each sense s a set of vectors (s) as follows:
otherwise.0
,document in the appears word theif1 dwv ii
}|)({)( sDddgs
15/24
3.3 Disambiguation (1/2)
This problem can be described as a classification problem of assigning g(d) to one of the classes represented by the vector sets (s) where s∈S(a).=> SVM.
For each sense s of an abbreviation a The positive class C+(s)= (s). The negative class C-(s)= .
16/24
3.3 Disambiguation (2/2)
An SVM is created for each sense s and trained with C+(s) and C-(s).
The result is a function hs:{0, 1}nR
The sense prediction
)()( predicts 0
)()( predicts 0))((
sCdg
sCdgdghs
.0))(( and)),(())((:)( ifonly and if
),(sense
dghdghdghaS
sda
s
s
17/24
4. Abbreviation Resolution If a long form is found in the text, its
most frequent form is kept. If no long form can be retrieved from
the document, then a look-up of the abbreviation in the dictionary is performed.
If only one sense is found, then the abbreviation is not ambiguous and the most frequent long form is kept.
If several senses are retrieved, then the disambiguation process is applied.
20/24
5.2 Results: disambiguation (1/2) The disambiguation is required for abbrevia
tions having several senses and occurring without the long form.
Considering abbreviations occurring 40 documents, there are 7806 abbreviations with 12330 senses. Out of these 7806 abbreviations, 1851 are polysemic, having on average 3.4 senses with a maximum of 32 senses for ‘PC’.
21/24
5.2 Results: disambiguation (2/2) The SVMs were trained and tested using a 5-
fold cross-validation schema. In order to avoid the explicit indication of th
e sense, the abbreviation long forms are removed from the text before the SVMs learn or classify the test documents.
The system achieves a precision of 98.9% for a recall of 98.2% (98.5% accuracy).
22/24
6. Discussion (1/3) The dictionary of abbreviations, the
context extraction and the disambiguation module are the three main components.
Dictionary The dictionary has been generated from
Medline so that its content is most suitable for abbreviation resolution in biomedical text.
The high quality of the dictionary is crucial to achieve the resolution of abbreviations with a high precision/recall.
23/24
6. Discussion (2/3) Context extraction
The context extraction is based on the text itself and not based on human annotations, unlike MeSH terms.
The context of a sense is represented with vectors that have on average 3000 non-empty features. In other words, each sense is represented with a considerable number of words.
24/24
6. Discussion (3/3) Disambiguation module
The accuracy of the disambiguation method profits from the high performances achieved by SVMs, which have been successfully used in many text classification tasks.
Disambiguation of abbreviations is more accurate than word sense disambiguation on English words because abbreviation’s senses are on average more distant.