HIKM’2006AMTEx Automatic Document Indexing in Large Medical Collections Angelos Hliaoutakis,...

25
HIKM’2006 AMTEx Automatic Document Indexing in Large Medical Collections Angelos Hliaoutakis, Kalliopi Zervanou, Euripides G.M. Petrakis Technical University of Crete, Chania, Greece Evangelos E. Milios Dalhousie University, Halifax, Canada
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    1

Transcript of HIKM’2006AMTEx Automatic Document Indexing in Large Medical Collections Angelos Hliaoutakis,...

HIKM’2006 HIKM’2006 AMTEx AMTEx

Automatic Document Indexing in Large Medical Collections

Automatic Document Indexing in Large Medical Collections

Angelos Hliaoutakis, Kalliopi Zervanou, Euripides G.M. Petrakis

Technical University of Crete, Chania, Greece

Evangelos E. MiliosDalhousie University, Halifax, Canada

HIKM’2006 HIKM’2006 AMTEx AMTEx

OverviewOverview

• The need for automatic assignment of index terms in large medical collections

• MMTx (by the US NLM)

• The AMTEx approach to medical document indexing

• AMTEx resources: MeSH & C/NC value

• Experiments & evaluation

• Discussion and future research

HIKM’2006 HIKM’2006 AMTEx AMTEx

Motivation and ObjectivesMotivation and Objectives

• MeSH is a taxonomy of medical terms

• Subset of UMLS Metathesaurus

• MEDLINE is indexed by MeSH terms (assigned by experts)

• Other medical texts need to be associated with MEDLINE, e.g. consumer medical literature

• Need for automatic assignment of MeSH terms to any medical text

HIKM’2006 HIKM’2006 AMTEx AMTEx

MMTx (MetaMap Transfer)MMTx (MetaMap Transfer)

Maps arbitrary text to UMLS Metathesaurus concepts:

Parsing to extract noun phrases(syntactic analysis - linguistic filter)

Variant Generation (uses SPECIALIST Lexicon)

Candidate Retrieval (mapping process to Metathesaurus Concepts)

Candidate Evaluation (criteria: centrality, variation, coverage, cohesiveness)

HIKM’2006 HIKM’2006 AMTEx AMTEx

MMTx ExampleMMTx Example Parsing

• Shallow syntactic analysis of the input text• Linguistic filtering: isolates noun phrases

Variant Generatione.g. “obstructive sleep apnea” has variants:obstructive sleep apnea, sleep apnea, sleep, apnea, osa,…

Candidate RetrievalCandidate Metathesaurus concepts for the variant “osa” : osa [osa antigen],

osa [osa gene product]osa [osa protein]osa [obstructive sleep apnea]

Candidate EvaluationObstructive Sleep apnea 1000Sleep Apnea 901Apnea 827… …Sleeping 793Sleepy 755

HIKM’2006 HIKM’2006 AMTEx AMTEx

MMTx limitationsMMTx limitations• MMTx focus on UMLS rather than MeSH

But MEDLINE indexing is based on MeSH

• Exhaustive variant generation:

the initial phrase is iteratively expanded into all possible UMLS variants

term overgeneration term concept diffusion unrelated terms added to the final candidate list

HIKM’2006 HIKM’2006 AMTEx AMTEx

The AMTEx method The AMTEx method

• New method for automatic indexing of medical documents

• Main idea:

Initial term extraction based on a hybrid linguistic/statistical approach, the C/NC value

Extracts general single and multi-word terms

Extracted terms are validated against MeSH

HIKM’2006 HIKM’2006 AMTEx AMTEx

ΑΜΤΕx OutlineΑΜΤΕx OutlineINPUT:Document Collection

INPUT:Document Collection C/NC value

Multi-word Term Extraction& Term Ranking

C/NC valueMulti-word Term Extraction

& Term Ranking

MeSHTerm Validation

MeSHTerm Validation

Single-word Term ExtractionNon-MeSH multi-word are broken down & validated against MeSH

Single-word Term ExtractionNon-MeSH multi-word are broken down & validated against MeSH

Variant GenerationVariant Generation Term Expansion(MeSH)

Term Expansion(MeSH)

MeSHThesaurusResource

MeSHThesaurusResource

OUTPUT:MeSH

Term Lists

OUTPUT:MeSH

Term Lists

HIKM’2006 HIKM’2006 AMTEx AMTEx

MeSH: Medical Subject HeadingsMeSH: Medical Subject Headings

The NLM medical & biological terms thesaurus:

• Organized in IS-A hierarchies – more than 15 taxonomies & more than 22,000 terms– a term may appear in multiple taxonomies

• No PART-OF relationships

• Terms organized into synonym sets called entry terms, including stemmed term forms

HIKM’2006 HIKM’2006 AMTEx AMTEx

Fragment of the MeSH IS-A HierarchyFragment of the MeSH IS-A Hierarchy

Root

Nervous systemdiseases

Neurologicmanifestations

pain

headache neuralgia

Cranial nervediseases

Facialneuralgia

HIKM’2006 HIKM’2006 AMTEx AMTEx

The C/NC value methodThe C/NC value method

• Hybrid (linguistic / statistical) term extraction method

• Domain independent

• Specifically designed for the identification of multi-word and nested terms:

compound & multi-word terms very common in biomedical domain

multi-word terms often used in indexing

HIKM’2006 HIKM’2006 AMTEx AMTEx

C-valueC-value• C-value: a phrase may be a term, if it

often appears alone or within other candidate terms

otherwise

α: candidate termf(α): frequencyTα: set of candidate terms containing αP(Tα): number of such terms

HIKM’2006 HIKM’2006 AMTEx AMTEx

NC-valueNC-value• NC-value: a phrase is more likely a term,

if it often appears in specific word context

w: context wordt(w): number of terms w appears withn: number of all termsfα(w): frequency of w as context word of α

HIKM’2006 HIKM’2006 AMTEx AMTEx

AMTEx step 1: C/NC valueMulti-word Term Extraction & Ranking

AMTEx step 1: C/NC valueMulti-word Term Extraction & Ranking

Part-of-Speech Tagging

Linguistic filtering:• N+ N

• (A|N)+ N

• ( (A|N)+ | ( (A|N)* (N P)? ) (A|N)* ) N

Candidate term ranking based on C/NC-value

Keep terms with NC-value > T1

HIKM’2006 HIKM’2006 AMTEx AMTEx

AMTEx step 2: MeSH Term Validation

AMTEx step 2: MeSH Term Validation

Candidate terms are validated against the MeSH Thesaurus (simple string matching)

Only candidate terms matching MeSH are kept

Multi-word candidates not matching MeSH may still contain (shorter) MeSH terms

HIKM’2006 HIKM’2006 AMTEx AMTEx

AMTEx step 3: Single-word Term Extraction

AMTEx step 3: Single-word Term Extraction

For multi-word terms not matching MeSH:

Multi-word are split into single-word terms

Single-word terms matched against MeSH

Matched MeSH terms added to term list

HIKM’2006 HIKM’2006 AMTEx AMTEx

AMTEx step 4: Term Variant Generation

AMTEx step 4: Term Variant Generation

Variants are added to the list of terms:

• Inflectional variants of the extracted terms identified during term extraction (C/NC-value)

• Stemmed term-forms available in MeSH

HIKM’2006 HIKM’2006 AMTEx AMTEx

AMTEx step 5: Term ExpansionAMTEx step 5: Term Expansion

HIKM’2006 HIKM’2006 AMTEx AMTEx

AMTEx step 5: Term ExpansionAMTEx step 5: Term Expansion

• Each term in the list is expanded with neighbouring terms in MeSH hierarchy

• The expansion may include terms more than one level higher or lower than the original term, depending on similarity threshold T

• Semantic similarity metric by Li et al.

Y. Li, Z. A. Bandar, and D. McLean. An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources. IEEE Trans. on Knowledge and Data Engineering, 15(4):871–882, July/Aug. 2003.

HIKM’2006 HIKM’2006 AMTEx AMTEx

ExampleExampleInput: Full text article

MEDLINE index terms: “Aged”, “Data Collection”, “Humans”,“Knee”, “Middle Aged”, “Osteoarthritis, Knee/complications”, “Osteoarthritis, Knee/diagnosis”, “Pain/classification”, “Pain/etiology”, “Prospective Studies”, “Research Support, Non-U.S. Gov’t”

MMTx terms: “osteoarthritis knee”, “retention”, “peat”, “rheumatology”, “acetylcholine”, “lysine acetate”, “potassium acetate”, “questionnaires”, “target population”, “population”, “selection bias”, “creativeness”, “reproduction”, “cohort studies”, “europe”, “couples”, “naloxone”, “sample size”, “arthritis”, “data collection”, “mail” ‘health status”, “respondents”, “ontario”, “universities”, “dna”, “baseline survey”, “medical records”, “informatics”, “general practitioners”, “gender”, “beliefs”, “logistic regression”, “female”, “marital status”, “employment status”, “comprehension”, “surveys”, “age distribution”, “manual”, “occupations”, “manuals”, “persons”, “females”, “minor”, “minority groups”, “incentives”, “business”, “ability”, “comparative study”, “odds ratio”, “biomedical research”, “pubmed”, “copyright”, “coding”, “longitudinal studies”, “immunoelectrophoresis”, “skin diseases”, “government”, “norepinephrine”, “social sciences”, “survey methods”, “tyrosine”, “new zealand”, “azauridine”, “gold”, “nonrespondents”, “cycloheximide”, “rheum”, “jordan”, “cadmium”, “radiopharmaceuticals”, “community”, “disease progression”, “history”

AMTEx terms: “health surveys”, “pain”, “review publication type”, “data collection”, “osteoarthritis knee”, “knee”, “science”, “health services needs and demand”, “population”, “research”, “questionnaires”, “informatics”, “health”

HIKM’2006 HIKM’2006 AMTEx AMTEx

EvaluationEvaluationPrecision and Recall measures

Dataset:• 61 full MEDLINE documents (not abstracts), from

PMC database of NCBI Pubmed• MEDLINE documents are paired to respective

MeSH index terms, manually assigned by experts

Ground Truth: • the set of MeSH document index terms

Benchmark method:• MMTx against our AMTEx

HIKM’2006 HIKM’2006 AMTEx AMTEx

Multi-Word Terms onlyMulti-Word Terms only

Method Precision Recall

MMTx 0,013 0,015

AMTEx (T = 0,5) 0,186 0,108

AMTEx (T = 0,6) 0,218 0,090

AMTEx (T = 0,7) 0,236 0,072

AMTEx (T = 0,8) 0,236 0,072

AMTEx (T = 0,9) 0,236 0,070

T: term expansion threshold, lower T means further expansion

HIKM’2006 HIKM’2006 AMTEx AMTEx

Contribution of Single-Word TermsContribution of Single-Word Terms

Method Precision Recall

MMTx 0,013 0,015

AMTEx 0,236 0,070

AMTEx & single-word MeSH terms 0,120 0,228

HIKM’2006 HIKM’2006 AMTEx AMTEx

Conclusions: AMTExConclusions: AMTEx

Designed for indexing and retrieval of MEDLINE documents

Focuses on multi-word term extraction using valid linguistic & statistical criteria

Based on MeSH -- similarly to human indexing

Selectively expands into term variants, synonyms

Outperforms the current benchmark MMTx method, in both precision & recall

HIKM’2006 HIKM’2006 AMTEx AMTEx

Future WorkFuture Work

• Better ranking of terms, using semantic similarity

• Learning of thresholds T1, T

• Word sense disambiguation to detect the correct sense for expansion rather than the most common sense

• Handling shorter documents