Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics...

21
Ontology-based Annotation & Query of Ontology-based Annotation & Query of TMA data TMA data Nigam Shah Stanford Medical Informatics ([email protected])

Transcript of Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics...

Page 1: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

Ontology-based Annotation & Query of Ontology-based Annotation & Query of TMA data TMA data

Nigam Shah

Stanford Medical Informatics([email protected])

Page 2: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

Tissue MicroarraysTissue Microarrays

www.nature.com/clinicalpractice/onc

Page 3: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

Stanford tissue microarray databaseStanford tissue microarray database

http://tma.stanford.edu/tma_portal/

Page 4: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

Key analysis issueKey analysis issue

Tissue microarrays query a large number of samples/patients for one protein.

The key query dimension in TMA data is a tissue sample

Because of the lack of a commonly used ontology to describe the diagnosis [or

annotations] for a given TMA sample in TMAD it is not easy to perform such as query.

Page 5: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

Ontologies consideredOntologies considered

The NCI Thesaurus, version 05.09g

The SNOMED-CT, from UMLS 2005 AA

Page 6: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

Available annotations for a blockAvailable annotations for a block

Each donor block in the TMA has semi-structured text associated with it.

ID Organ Diagnosis Subclass 1 Subclass 2 Subclass 3 Subclass 4

2334 Ovary MMMT

3335 Prostate Carcinoma Adeno intraductal

7022 Bladder Carcinoma Transitional cell

In situ

7288 Testis teratoma immature Embryonal carcinoma

8060 Liver Carcinoma hepatocellular No vascular invasion

HepC cirrhosis

6662 Soft tissue Sarcoma Leiomyo epithelioid

6663 lung Sarcoma Leiomyo epithelioid

4713 stomach carcinoma unknown

Page 7: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

Map text to ontology termsMap text to ontology terms

Make all possible permutations Rules to weed out bad permutations

Check for an exact match with NCI and SNOMED-CT terms (and/or synonyms) Rules to weed out bad matches

Prostate Carcinoma Adeno intraductal 24 permutations

Prostate Carcinoma Adeno intraductal:Carcinoma Prostate intraductal Adeno:Adeno Carcinoma intraductal Prostate:Prostate intraductal Adeno Carcinoma

Prostate_Ductal_Adenocarcinoma

Page 8: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

Sample matches (from NCI-T)Sample matches (from NCI-T)

Organ Diagnosis Subclass 1 Subclass 2 Subclass 3 Ontology Terms

2334 Ovary MMMT Malignant_Mixed_Mesodermal_Mullerian_Tumor

3335 Prostate Carcinoma Adeno intraductal Prostate_Ductal_Adenocarcinoma

7022 Bladder Carcinoma Transitional cell

In situ Stage_0_Transitional_Cell_Carcinoma

Transitional_Cell_Carcinoma

Bladder_Carcinoma

Carcinoma_in_situ

7288 Testis teratoma immature Embryonal carcinoma

Immature|Teratoma

Testicular_Embryonal_Carcinoma

Immature_Teratoma

8060 Liver Carcinoma hepatocellular No vascular invasion

HepC cirrhosis

Hepatocellular_Carcinoma

6662 Soft tissue Sarcoma Leiomyo epithelioid Soft_Tissue_Sarcoma

Leiomyosarcoma

Epithelioid_Sarcoma

6663 lung Sarcoma Leiomyo epithelioid Lung_Sarcoma

Leiomyosarcoma

Epithelioid_Sarcoma

4713 stomach carcinoma unknown Gastric_carcinoma

Page 9: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

Results and validationResults and validation

Mapped the term-sets for 8495 records, which correspond to 783 distinct term-sets. 577 term-sets (6614 records) matched to the NCI thesaurus 365 term-sets (3465 records) matched to SNOMED-CT

In total mapped 6871 records (80%) of annotated records in TMAD (641 distinct term-sets) to one or more ontology terms.

Validation NCI SNOMED-CT

Appropriate Inappropriate Appropriate Inappropriate

Set-1 41 9 41 9

Set-2 42 8 43 7

Set-3 46 4 38 12

Total 129 21 122 28

Average (%) 43.0 (86%) 7.0 (14%) 40.66 (81%) 9.33 (19%)

Page 10: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

Browsing interfaceBrowsing interface

Page 11: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

Parents & Siblings nodes with data (Burly wood)

Child nodes with data (Yellow)

Child nodes with no data (Grey)

Page 12: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

Click on the “anchor” link to get dataClick on the “anchor” link to get data

Page 13: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

2/17/2006 9/23/2068495 8518 Donor blocks to match6614 7162 Donor blocks with NCI match3465 6959 Donor blocks with SNOMEDCT match6871 7399 Donor blocks with any match3208 6722 Donor blocks with both match

Updates since FebruaryUpdates since February

2/17/2006 9/23/2006783 791 Distinct Terms577 610 Distinct Terms with NCI match365 610 Distinct Terms with SNOMEDCT match641 651 Distinct Terms with any match295 569 Distinct Terms with both match

0

100

200

300

400

500

600

700

800

900

Distinct Terms Distinct Terms w ithNCI match

Distinct Terms w ithSNOMEDCT match

Distinct Terms w ithany match

Distinct Terms w ithboth match

2/17/2006

9/23/2006

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Donor blocks tomatch

Donor blocks w ithNCI match

Donor blocks w ithSNOMEDCT match

Donor blocks w ithany match

Donor blocks w ithboth match

2/17/2006

9/23/206

Page 14: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

How do ontology based annotation help?How do ontology based annotation help?

Better search: we can retrieve samples of all the retroperitoneal tumors or malignant uterine neoplasms for example.

Better Integration of data: we can correlate gene expression with protein expression across multiple tumor types.

Tissue microarray data from TMADGene expression data from GEO

Page 15: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

Integrating mRNA and protein expressionIntegrating mRNA and protein expression

Proteins

Sam

ples

Genes Sam

ples

Page 16: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

Partial alignment of NCI-T and SNOMED-CT as a “bonus”

Page 17: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

Steps in AlignmentSteps in Alignment

Anchor identification Identify similar class

labels in the ontologies to be aligned

Usually done by string matching

Ontology structure Use the “similar”

classes as anchors and examine the local [graph] structure around them to inform the “similarity” metric

Root

Term-1 Term-2

Term-3 Term-4

Term-5

R

t1 t2

t4

t5 t6 t7

t3

Page 18: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

We might improve alignment …We might improve alignment …

Root

Term-1 Term-2

Term-3 Term-4

Term-5

R

t1 t2

t4

t5 t6 t7

t3

Term-2 t1

Term-5 t5

Ontology [graph] structure based step

Provide Anchors from annotated data

S2

t5

Term-5

S2

t5

Term-5

Page 19: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

Better Text-mapping Better Text-mapping Better Alignment Better Alignment

0

100

200

300

400

500

600

700

800

900

Distinct Terms Distinct Terms w ithNCI match

Distinct Terms w ithSNOMEDCT match

Distinct Terms w ithany match

Distinct Terms w ithboth match

2/17/2006

7/23/2006

2/17 7/23

783 791 Distinct Terms

577 620 Terms with NCI match

365 610 Terms with SNOMEDCT match

641 654 Terms with any match

295 576 Terms with both match

Page 20: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

SummarySummary

Ability to map word-groups to ontology terms

Proteins

Sam

ple

s

Genes Sam

ples

Root

Term-1 Term-2

Term-3 Term-4

Term-5

R

t1 t2

t4

t5 t6 t7

t3

Term-2 t1

Term-5 t5

Ontology [graph] structure based step

Provide Anchors from annotated data

S2

t5

Term-5

S2

t5

Term-5

Page 21: Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics (nigam@stanford.edu)

Credits and acknowledgementsCredits and acknowledgements

PathologyRobert MarinelliMatt van de Rijn

Medical InformaticsKaustubh SupekarDaniel RubinMark Musen

FundingNIH